From steffen_moeller at gmx.de Mon Jul 2 10:23:43 2012 From: steffen_moeller at gmx.de (=?ISO-8859-1?Q?Steffen_M=F6ller?=) Date: Mon, 02 Jul 2012 18:23:43 +0200 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: <4FCCEBBA.4070205@ldeo.columbia.edu> References: <4FCCD8CE.6040309@ldeo.columbia.edu> <4FCCEBBA.4070205@ldeo.columbia.edu> Message-ID: <4FF1CB0F.5060503@gmx.de> Hello, I just read this by chance. I do not see that there would be anything speaking against a direct support of Torque in OpenMPI. @Manuel from Debian-OpenMPI: is respective support still live? Cheers, Steffen On 06/04/2012 07:09 PM, Gus Correa wrote: > Hi Damian > > I am not a Debian user, > but I would guess it is unlikely that > the Debian packages will add Torque support. > This is because OpenMPI can support SGE, Slurm, and other > resource managers, and you probably cannot add support for > all of them at the same time. > Which resource manager one chooses is a matter of taste. > > It may be easier to just uninstall the Debian OpenMPI packages > and install OpenMPI from source, with Torque support, > in a non-system directory [/usr/local/openmpi, > /opt/openmpi-X.Y.Z, ...] > > Something like this: > > ./configure --prefix=/a/non-system/directory > --with-tm=/your/torque/directory ... > make > make install > > Uninstalling the existent OpenMPI packages will save > you headaches with inconsistent/duplicate/mixed binaries, > libraries, paths, etc. > > Then add the new OpenMPI directories to the > appropriate environment variables > [PATH and LD_LIBRARY_PATH] the way you prefer [say, via > .bashrc, .tcshrc] > > I hope this helps, > Gus Correa > > On 06/04/2012 12:44 PM, Damian Montaldo wrote: >> On Mon, Jun 4, 2012 at 12:58 PM, Damian Montaldo >> wrote: >>> On Mon, Jun 4, 2012 at 12:48 PM, Gus Correa wrote: >>>> Hi Damian >>>> >>>> Did you build your OpenMPI with Torque support? >>>> Or did you install it from a Debian package? >>> Hig Gus, I just install de Debian package of torque an openmpi (both >>> form the Debian repository) >>> >>>> The Debian OpenMPI package [if it exists] may not >>>> have Torque support. >>>> In this case, the mpiexec/mpirun probably won't >>>> know how to coordinate with Torque regarding >>>> nodes, cores, resources, etc. >>>> >>>> You can do >>>> 'mpicc --showme' >>>> to see if >>>> '-ltorque' >>>> appears there. >>>> >>>> I hope this helps, >>>> Gus Correa >>> $ mpicc --showme >>> gcc -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi >>> -pthread -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lopen-pal -ldl >>> -Wl,--export-dynamic -lnsl -lutil -lm -ldl >>> >>> You're right, thanks a lot! >>> >>> I'll try build it from source code instead of using the package from Debian. >>> If I cloud solve it I'll post it here. >> I was looking forward to report this bug but it was reported before here >> >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592887 >> Build with support for Torque (except on HURD). >> >> I've installed that version (1.4.2-4) but it seems to lack of support. >> I'll continue this issue in the debian related lists or in the OpemMPI list. >> >> Thanks. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Mon Jul 2 10:39:47 2012 From: sm4082 at nyu.edu (Sreedhar M) Date: Mon, 2 Jul 2012 12:39:47 -0400 Subject: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. In-Reply-To: <1341105634.93549.YahooMailClassic@web92203.mail.cnh.yahoo.com> References: <1341105634.93549.YahooMailClassic@web92203.mail.cnh.yahoo.com> Message-ID: <3FC2336B-8D90-42CE-B4D8-DC487682C6F8@nyu.edu> You need to install BLCR on all the compute nodes. I am assuming you're trying to run jobs on one of compute nodes and the node that has torque is not one of them. Check BLCR website for more information on adding path of blcr libraries to LD_LIBRARY_PATH. Sreedhar Manchu HPC Support Specialist ITS-Esystems/Research Services New York University, NY - 100012 On Jun 30, 2012, at 9:20 PM, TingtingYang wrote: > thank you for your reply, > I only installed blcr and torque on one node,is it OK? > > tingting.yang > > --- 12?6?30????, Sreedhar Manchu ??? > > ???: Sreedhar Manchu > ??: Re: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. > ???: "Torque Users Mailing List" > ??: 2012?6?30?,??,??12:02 > > Just making sure: > > You have installed BLCR on all the nodes. Right? > > Sreedhar. > > On Jun 28, 2012, at 6:09 AM, TingtingYang wrote: > >> >> hi all: >> I encounter a error when i want to use torque/blcr to chedkpoint my job. >> I submit a job with qsub -c enabled ./crtest and hold it with qhold job_ID >> job error file saied: >> ERROR: ld.so: object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. >> >> and /var/log/messages saied: >> >> Jun 28 17:32:09 node8 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 24946 55.node8 pbs pbs /var/spool/torque/checkpoint/55.node8.CK ckpt.55.node8.1340875929 15 - >> Jun 28 17:32:09 node8 kernel: blcr: Retry request on -CR_ENOSUPPORT >> Jun 28 17:32:09 node8 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 24946 --file ckpt.55.node8.1340875929) failed with rc=52: - Retry request on -CR_ENOSUPPORT Checkpoint failed: support missing from application >> Jun 28 17:32:09 node8 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52 >> >> it's sound like that i do not export BLCR library in my LD_LIBRARY_PATH,but i do setup my user enviroment and I can cr_run,cr_checkpoint and cr_restart in command line. >> I add $ENV{LD_LIBRARY_PATH} = "blcr_libpath" ; in checkpoint_script and the error still exist. >> I use blcr-0.8.4 and torque-2.4.16 >> so is someone help, >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > -----???????----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120702/b1fa2676/attachment.html From damianmontaldo at gmail.com Mon Jul 2 16:09:29 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Mon, 2 Jul 2012 19:09:29 -0300 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: <4FF1CB0F.5060503@gmx.de> References: <4FCCD8CE.6040309@ldeo.columbia.edu> <4FCCEBBA.4070205@ldeo.columbia.edu> <4FF1CB0F.5060503@gmx.de> Message-ID: Hi, I wrote him twice and I didn't receive any answer. On Mon, Jul 2, 2012 at 1:23 PM, Steffen M?ller wrote: > Hello, > > I just read this by chance. I do not see that there would be > anything speaking against a direct support of Torque in > OpenMPI. @Manuel from Debian-OpenMPI: is respective > support still live? > > Cheers, > > Steffen > > On 06/04/2012 07:09 PM, Gus Correa wrote: >> Hi Damian >> >> I am not a Debian user, >> but I would guess it is unlikely that >> the Debian packages will add Torque support. >> This is because OpenMPI can support SGE, Slurm, and other >> resource managers, and you probably cannot add support for >> all of them at the same time. >> Which resource manager one chooses is a matter of taste. >> >> It may be easier to just uninstall the Debian OpenMPI packages >> and install OpenMPI from source, with Torque support, >> in a non-system directory [/usr/local/openmpi, >> /opt/openmpi-X.Y.Z, ...] >> >> Something like this: >> >> ./configure --prefix=/a/non-system/directory >> --with-tm=/your/torque/directory ... >> make >> make install >> >> Uninstalling the existent OpenMPI packages will save >> you headaches with inconsistent/duplicate/mixed binaries, >> libraries, paths, etc. >> >> Then add the new OpenMPI directories to the >> appropriate environment variables >> [PATH and LD_LIBRARY_PATH] the way you prefer [say, via >> .bashrc, .tcshrc] >> >> I hope this helps, >> Gus Correa >> >> On 06/04/2012 12:44 PM, Damian Montaldo wrote: >>> On Mon, Jun 4, 2012 at 12:58 PM, Damian Montaldo >>> wrote: >>>> On Mon, Jun 4, 2012 at 12:48 PM, Gus Correa wrote: >>>>> Hi Damian >>>>> >>>>> Did you build your OpenMPI with Torque support? >>>>> Or did you install it from a Debian package? >>>> Hig Gus, I just install de Debian package of torque an openmpi (both >>>> form the Debian repository) >>>> >>>>> The Debian OpenMPI package [if it exists] may not >>>>> have Torque support. >>>>> In this case, the mpiexec/mpirun probably won't >>>>> know how to coordinate with Torque regarding >>>>> nodes, cores, resources, etc. >>>>> >>>>> You can do >>>>> 'mpicc --showme' >>>>> to see if >>>>> '-ltorque' >>>>> appears there. >>>>> >>>>> I hope this helps, >>>>> Gus Correa >>>> $ mpicc --showme >>>> gcc -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi >>>> -pthread -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lopen-pal -ldl >>>> -Wl,--export-dynamic -lnsl -lutil -lm -ldl >>>> >>>> You're right, thanks a lot! >>>> >>>> I'll try build it from source code instead of using the package from Debian. >>>> If I cloud solve it I'll post it here. >>> I was looking forward to report this bug but it was reported before here >>> >>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592887 >>> Build with support for Torque (except on HURD). >>> >>> I've installed that version (1.4.2-4) but it seems to lack of support. >>> I'll continue this issue in the debian related lists or in the OpemMPI list. >>> >>> Thanks. >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jscoggins at lbl.gov Tue Jul 3 14:55:23 2012 From: jscoggins at lbl.gov (Jacqueline Scoggins) Date: Tue, 3 Jul 2012 13:55:23 -0700 Subject: [torqueusers] Missing resource_used info in /var/spool/torque/server_priv/accounting - torque 4.0.x Message-ID: I am running the new torque version and when my job finished I am not seeing the resource_used in the accounting file. Am I missing something in my configuration? Torque - installed: torque-devel-4.1.0-1.cri.x86_64 torque-debuginfo-4.1.0-1.cri.x86_64 torque-client-4.1.0-1.cri.x86_64 torque-drmaa-4.1.0-1.cri.x86_64 torque-server-4.1.0-1.cri.x86_64 torque-4.1.0-1.cri.x86_64 torque-scheduler-4.1.0-1.cri.x86_64 grep 1366 20120703 07/03/2012 13:32:12;Q;1366.phoenix.scs.lbl.gov;queue=lr_batch 07/03/2012 13:32:12;Q;1366.phoenix.scs.lbl.gov;queue=lr_short 07/03/2012 13:32:49;S;1366.phoenix.scs.lbl.gov;user=scoggins group=scs jobname=STDIN queue=lr_short ctime=1341347532 qtime=1341347532 etime=1341347532 start=1341347569 owner=scoggins at phoenix.scs.lbl.gov exec_host=n0000.phoenix/0 Resource_List.neednodes=1:ppn=1:lr1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1:lr1 Resource_List.walltime=00:30:00 07/03/2012 13:36:48;E;1366.phoenix.scs.lbl.gov;user=scoggins group=scs jobname=STDIN queue=lr_short ctime=1341347532 qtime=1341347532 etime=1341347532 start=1341347569 owner=scoggins at phoenix.scs.lbl.gov exec_host=n0000.phoenix/0 Resource_List.neednodes=1:ppn=1:lr1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1:lr1 Resource_List.walltime=00:30:00 session=2873 end=1341347808 Exit_status=265 This is from this command: qsub -I -q lr_batch -l nodes=1:ppn=1:lr1 And the exit of 265 is confusing. Because all I am doing from a standard shell is typing the word - exit and my job completes with an exit code of 265. Here is my server configuration settings: set server scheduling = True set server acl_hosts = phoenix.scs set server acl_hosts += localhost set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server job_stat_rate = 45 set server poll_jobs = True set server log_level = 1 set server mom_job_sync = True set server keep_completed = 300 set server submit_hosts = phoenix.scs set server allow_node_submit = True set server next_job_number = 1366 set server moab_array_compatible = True Thanks Jackie From z99621 at aol.com Wed Jul 4 12:24:58 2012 From: z99621 at aol.com (z99621 at aol.com) Date: Wed, 4 Jul 2012 14:24:58 -0400 (EDT) Subject: [torqueusers] FW: Whats up. Message-ID: <8CF282973948A6B-1084-1F20A@webmail-m075.sysops.aol.com> I hope your job is going well. I just wanted to alert you about a superb job opp in locality. We have had few of our clients take this opportunity and I have heard lots of awesome stories. The newspaper has story featuring one of our clients, Kelly R.. It will also you all you all the information you need to get started. The article is at http://raikunni.com/emotionbrain/John_Phillips61/ and I guess the story will be featured on the home-page until tomorrow. Many thanks, From ytt515 at yahoo.cn Tue Jul 3 19:47:48 2012 From: ytt515 at yahoo.cn (TingtingYang) Date: Wed, 4 Jul 2012 09:47:48 +0800 (CST) Subject: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. In-Reply-To: <3FC2336B-8D90-42CE-B4D8-DC487682C6F8@nyu.edu> Message-ID: <1341366468.69042.YahooMailClassic@web92202.mail.cnh.yahoo.com> ? ? thanks for your reply,right now i just want to test checkpoint with blcr+torque+openmpi,so i install torque+blcr on one node and configure the torque and blcr's bin/lib in my PATH and also followed your document on this subject?http://www.clusterresources.com/torquedocs21/2.6jobcheckpoint.shtml ?and start pbs_server,pbs_sched and pbs_mom all on this node, I can run cr_checkpoint in command line with no root user.? ?I find someone has same problem?http://www.mailinglistarchive.com/html/torqueusers at supercluster.org/2012-04/msg00159.html? ?so is his problem solved?I can not get his email address. tingting.yang --- 12?7?3????, Sreedhar M ??? ???: Sreedhar M ??: Re: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. ???: "Torque Users Mailing List" ??: 2012?7?3?,??,??12:39 You need to install BLCR on all the compute nodes. I am assuming you're trying to run jobs on one of compute nodes and the node that has torque is not one of them. Check BLCR website for more information on adding path of blcr libraries to LD_LIBRARY_PATH. Sreedhar Manchu HPC Support Specialist ITS-Esystems/Research Services New York University, NY?- 100012 On Jun 30, 2012, at 9:20 PM, TingtingYang wrote: thank you for your reply,I only installed blcr and torque on one node,is it OK? tingting.yang --- 12?6?30????, Sreedhar Manchu ??? ???: Sreedhar Manchu ??: Re: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. ???: "Torque Users Mailing List" ??: 2012?6?30?,??,??12:02 Just making sure: You have installed BLCR on all the nodes. Right? Sreedhar. On Jun 28, 2012, at 6:09 AM, TingtingYang wrote: hi all:? ?I encounter a error when i want to use torque/blcr to chedkpoint my job.? ?I submit a job with qsub -c enabled ./crtest and hold it with qhold job_ID? ?job error file saied: ? ?ERROR: ld.so: object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. and /var/log/messages saied: ??Jun 28 17:32:09 node8 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 24946 55.node8 pbs pbs /var/spool/torque/checkpoint/55.node8.CK ckpt.55.node8.1340875929 15 -?Jun 28 17:32:09 node8 kernel: blcr: Retry request on -CR_ENOSUPPORTJun 28 17:32:09 node8 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 24946 --file ckpt.55.node8.1340875929) failed with rc=52: - Retry request on -CR_ENOSUPPORT Checkpoint failed: support missing from application?Jun 28 17:32:09 node8 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52? it's sound like that i do not export BLCR library in my LD_LIBRARY_PATH,but i do setup my user enviroment and I can cr_run,cr_checkpoint and cr_restart in command line.I add $ENV{LD_LIBRARY_PATH} = "blcr_libpath" ; in checkpoint_script and the error still exist.?I use blcr-0.8.4 and torque-2.4.16so is someone help,_______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -----???????----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -----???????----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120704/9970d1f5/attachment.html From simon.brennan at ersa.edu.au Wed Jul 4 19:52:54 2012 From: simon.brennan at ersa.edu.au (Simon Brennan) Date: Thu, 05 Jul 2012 11:22:54 +0930 Subject: [torqueusers] nodes file persistent gpus setting In-Reply-To: <007DECE986B47F4EABF823C1FBB19C62010503312B3A@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C62010503312B3A@exvic-mbx04.nexus.csiro.au> Message-ID: <4FF4F376.3040302@ersa.edu.au> Sean (my colleague) and I have still been banging our head against the wall with this issue. We've got torque 3.0.4 with gpu support enabled, Cuda 4.0. After some testing on my local desktop and a 17 node GPU cluster (mixture of Tesla cards and GTX cards) we've found that if you have a nodes file with the gpus= and attributes, plus you make a change to state of a node (pbsnodes -o / pbsnodes -r) that has both gpus= and an attribute, for some crazy unknown reason the nodes file is modified, all gpus= lines are removed and any # comments. Entries that only have a gpus= or an attribute aren't affected, only one nodes that have both. Why is there even code in Torque (specifically pbs_server) that is capable of writing to the nodes file!! Some examples.... BLAH is just a random node attribute Test1 -=-=-=-=-=-=-= nodes file contents: node1 np=1 gpus=2 BLAH node2 np=2 start torque server and mom. #pbsnodes -r node2 (File doesn't change) #pbsnodes -r node1 (File changes after command is run, stat on file confirms this) nodes file contents node1 np=1 BLAH node2 np=2 =-=-=-=-=-=-= Test2 -=-=-=-=-=-=-= nodes file contents: node1 np=1 gpus=2 BLAH node2 np=2 gpus=2 start torque server and mom. #pbsnodes -r node2 (File doesn't change) #pbsnodes -r node1 (File changes after command is run, stat on file confirms this) nodes file contents node1 np=1 BLAH node2 np=2 =-=-=-=-=-=-= Test3 -=-=-=-=-=-=-= nodes file contents: node1 np=1 BLAH node2 np=2 gpus=2 start torque server and mom. #pbsnodes -r node2 (File changes after command is run, stat on file confirms this) nodes file contents node1 np=1 BLAH node2 np=2 =-=-=-=-=-=-= Test4 -=-=-=-=-=-=-= nodes file contents: node1 np=1 gpus=2 node2 np=2 gpus=2 start torque server and mom. #pbsnodes -r node2 (File doesn't change) #pbsnodes -r node1 (File doesn't change) nodes file contents node1 np=1 gpus=2 node2 np=2 gpus=2 =-=-=-=-=-=-= Regards Simon Brennan -------- Original Message -------- Subject: Re: [torqueusers] nodes file persistent gpus setting Date: Thu, 17 May 2012 15:50:09 +1000 From: Reply-To: Torque Users Mailing List To: HI Sean, Woah -- we are _/not/_ using the integrated nvidia gpu support (so far anyway). Perhaps that wasn't actually the problem on your system -- are you really sure that solved the problem and was not just a coincidence? We have nvidia drivers (on that compute node) but no other nvidia software on this system. Gareth *From:*Sean Reilly [mailto:sean.reilly at ersa.edu.au] *Sent:* Thursday, 17 May 2012 12:21 PM *To:* Torque Users Mailing List *Subject:* Re: [torqueusers] nodes file persistent gpus setting Hi Gareth We saw the same behaviour when we enabled the tdk-1.285 libraries on the GPU backend Nodes in the ld.config path. - It is needed on the CPU (non-gpu) Nodes - But when added to the PATH on the GPU Nodes - the PBS_MOM complains about something missing (*Sorry I cant remember what it is - but it may have been some nvidia or nvc nvq type library*) - Then the PBS_MOM rewrites the nodes file on the server side. *removing the gpus= or truncating the line from where 'gpus=' is written* this was fixed by commenting out these libs on the GPU backend Node. /etc/ld.so.conf.d/tdk.conf #This file was made by puppet, do not edit it directly! #/opt/shared/tdk/1.285/lib64 #/opt/shared/tdk/1.285/lib Regards Sean On 17/05/12 05:56, Ken Nielson wrote: On Sun, Apr 1, 2012 at 7:36 PM, > wrote: Hi, Can anyone confirm the following behavior (bug)? If you give a node gpus like so: qmgr -c 'set node gpunode01 gpus = 2' or in the nodes file gpunode01 np=12 gpus=2 Then the node has (logical) gpus defined and they can be scheduled as in: http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php (though 1.5.3 doesn't mention specifying both np= and gpus= which I suspect needs fixing). This setup works fine for us until we restart the pbs_server at which time the gpus disappear (you can see this in the output of pbsnodes). The nodes file gets altered to remove the gpus= setting. Note that we are using version 3.0.3-snap.xxx and NOT the integrated nvidia gpu support. Does anyone else see the behavior? You don't need physical gpus to test, just a system you are prepared to mess with a little including restarting the pbs_server. Regards, Gareth Gareth, Have you entered a ticket in bugzilla for this. Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- *Sean Reilly* Systems Administrator & Applications Support Officer eResearchSA Phone : +61 8 8313 8352 Mobile: +61 450 840 246 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120705/bb37de9c/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 10004 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120705/bb37de9c/attachment-0001.png -------------- next part -------------- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jascha.wang at gmail.com Wed Jul 4 20:55:32 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Thu, 5 Jul 2012 10:55:32 +0800 Subject: [torqueusers] restrict job number of each type Message-ID: Suppose I have 2 types of jobs Type-A and Type-B, with Type-A job uses heavy CPU load and Type-B job uses heavy memory load. Now how I can set maui to allow no more than one job of each type to simultaneously run on each node. I thought it might be a maui question so I also post this request to the maui mail list, but if someone here already deal with this kind of problem or have some clue on it, I appreciate your advice. Xiangqian From ytt515 at yahoo.cn Wed Jul 4 01:04:56 2012 From: ytt515 at yahoo.cn (TingtingYang) Date: Wed, 4 Jul 2012 15:04:56 +0800 (CST) Subject: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. In-Reply-To: <3FC2336B-8D90-42CE-B4D8-DC487682C6F8@nyu.edu> Message-ID: <1341385496.32801.YahooMailClassic@web92206.mail.cnh.yahoo.com> I solved my problem with add #PBS -v in my submit file.And I can see some environment variables added in Variable_List.thanks all the way,hope this message will help others? tingting.yang --- 12?7?3????, Sreedhar M ??? ???: Sreedhar M ??: Re: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. ???: "Torque Users Mailing List" ??: 2012?7?3?,??,??12:39 You need to install BLCR on all the compute nodes. I am assuming you're trying to run jobs on one of compute nodes and the node that has torque is not one of them. Check BLCR website for more information on adding path of blcr libraries to LD_LIBRARY_PATH. Sreedhar Manchu HPC Support Specialist ITS-Esystems/Research Services New York University, NY?- 100012 On Jun 30, 2012, at 9:20 PM, TingtingYang wrote: thank you for your reply,I only installed blcr and torque on one node,is it OK? tingting.yang --- 12?6?30????, Sreedhar Manchu ??? ???: Sreedhar Manchu ??: Re: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. ???: "Torque Users Mailing List" ??: 2012?6?30?,??,??12:02 Just making sure: You have installed BLCR on all the nodes. Right? Sreedhar. On Jun 28, 2012, at 6:09 AM, TingtingYang wrote: hi all:? ?I encounter a error when i want to use torque/blcr to chedkpoint my job.? ?I submit a job with qsub -c enabled ./crtest and hold it with qhold job_ID? ?job error file saied: ? ?ERROR: ld.so: object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. and /var/log/messages saied: ??Jun 28 17:32:09 node8 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 24946 55.node8 pbs pbs /var/spool/torque/checkpoint/55.node8.CK ckpt.55.node8.1340875929 15 -?Jun 28 17:32:09 node8 kernel: blcr: Retry request on -CR_ENOSUPPORTJun 28 17:32:09 node8 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 24946 --file ckpt.55.node8.1340875929) failed with rc=52: - Retry request on -CR_ENOSUPPORT Checkpoint failed: support missing from application?Jun 28 17:32:09 node8 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52? it's sound like that i do not export BLCR library in my LD_LIBRARY_PATH,but i do setup my user enviroment and I can cr_run,cr_checkpoint and cr_restart in command line.I add $ENV{LD_LIBRARY_PATH} = "blcr_libpath" ; in checkpoint_script and the error still exist.?I use blcr-0.8.4 and torque-2.4.16so is someone help,_______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -----???????----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -----???????----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120704/e133782e/attachment.html From delphine.ramalingom at univ-reunion.fr Thu Jul 5 05:34:20 2012 From: delphine.ramalingom at univ-reunion.fr (delphine.ramalingom at univ-reunion.fr) Date: Thu, 5 Jul 2012 15:34:20 +0400 (RET) Subject: [torqueusers] pbsdsh Message-ID: <20985.80.12.213.34.1341488060.squirrel@webmail.univ-reunion.fr> Hi, I need some precision about pbsdsh : When I use the command pbsdsh and make post the value of $PBS_TASKNUM (my script below), it begins at number 2. Is it always like this ? I used torque 4.0.2. #!/bin/bash echo "Hello from $PBS_TASKNUM ... Reading file.$PBS_TASKNUM" Regards, Delphine From micafer1 at upv.es Thu Jul 5 01:25:42 2012 From: micafer1 at upv.es (Miguel Caballer) Date: Thu, 05 Jul 2012 09:25:42 +0200 Subject: [torqueusers] Get queue resources_default.neednodes as a non-root user Message-ID: <4FF54176.4040203@upv.es> Hi, We are working in a qsub wrapper. We want to get the list of nodes that are requested in a job submission. We can get the list of properties of a node with the pbs command. But the problem is that we cannot get the information about the ?resources_default.neednodes? of the queues using the qmgr -c "p s" as a non-root user. Is there any way to obtain this information for a non-root user? Thanks. From Rob.Holmes at bmtwbm.com.au Thu Jul 5 22:58:01 2012 From: Rob.Holmes at bmtwbm.com.au (Rob Holmes) Date: Fri, 6 Jul 2012 04:58:01 +0000 Subject: [torqueusers] nodes file - basic install problems Message-ID: <74C3EAAEAFC2E746A6BDC0F215CABBD70360D9@wbm-mail.bmt-wbm.local> Hi, I?m installing a small HPC cluster at work, which I?ve never done before and it?s causing me problems. My nodes file contains 14 compute nodes named node01, node02, etc. At the moment I just have four nodes switched on, with the remainder shown as ?down? with pbsnodes -a. When I submit a number of jobs, jobs are submitted to the first two nodes with the remaining two marked as ?free?, regardless of how many jobs are waiting to be submitted. Jobs are kept in the queue until either of node01 or node02 come free, then are run. node03 and node12 (the other two live nodes) never run a job. However, when I remove node01 for example (by commenting out node01 in the nodes file and restarting pbs_server), jobs will run on node12. Bizarrely, node03 is then marked as ?down? in pbsnodes ?a. This is long but basically I?m getting a lot of odd behavior and I?m not sure where to start debugging. All live nodes are running pbs_mom. The system was working as expected with just one compute node. With more than one it is having problems. I?m running pbs_sched. Can anyone please help? Cheers, Rob Rob Holmes Environmental Scientist ? Catchments and Receiving Environments BMT WBM Pty Ltd Level 8, 200 Creek Street Brisbane QLD 4000 Australia P: +61 7 3831 6744 F: W: www.bmtwbm.com.au [cid:image4b8ac8.GIF at 1b9b1df5.4bb7fab4] [cid:imagee21530.GIF at d18efb23.4ebf582f] E-mail confidentiality notice and disclaimer: The contents of this e-mail are intended for the use of the mail addressee(s) shown. If you are not that person, you are not allowed to read, action, copy, forward, distribute or disclose the contents and you should delete it from your system. BMT WBM accepts no liability for any errors or omissions in the content of this e-mail, nor does it accept liability for statements which are those of the author and clearly not made on behalf of the company. Commercial Terms and Conditions: Unless otherwise agreed by BMT WBM in writing, all services or products supplied by BMT WBM shall be subject to and governed by BMT WBM?s standard terms and conditions, which are available upon request. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120706/4fb2f98a/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image4b8ac8.GIF Type: image/gif Size: 3074 bytes Desc: image4b8ac8.GIF Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120706/4fb2f98a/attachment-0002.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: imagee21530.GIF Type: image/gif Size: 3455 bytes Desc: imagee21530.GIF Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120706/4fb2f98a/attachment-0003.gif From gus at ldeo.columbia.edu Fri Jul 6 09:30:07 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 06 Jul 2012 11:30:07 -0400 Subject: [torqueusers] nodes file - basic install problems In-Reply-To: <74C3EAAEAFC2E746A6BDC0F215CABBD70360D9@wbm-mail.bmt-wbm.local> References: <74C3EAAEAFC2E746A6BDC0F215CABBD70360D9@wbm-mail.bmt-wbm.local> Message-ID: <4FF7047F.9050500@ldeo.columbia.edu> On 07/06/2012 12:58 AM, Rob Holmes wrote: > Hi, > > I?m installing a small HPC cluster at work, which I?ve never done before > and it?s causing me problems. > > My nodes file contains 14 compute nodes named node01, node02, etc. At > the moment I just have four nodes switched on, with the remainder shown > as ?down? with pbsnodes -a. When I submit a number of jobs, jobs are > submitted to the first two nodes with the remaining two marked as > ?free?, regardless of how many jobs are waiting to be submitted. Jobs > are kept in the queue until either of node01 or node02 come free, then > are run. node03 and node12 (the other two live nodes) never run a job. > > However, when I remove node01 for example (by commenting out node01 in > the nodes file and restarting pbs_server), jobs will run on node12. > Bizarrely, node03 is then marked as ?down? in pbsnodes ?a. > > This is long but basically I?m getting a lot of odd behavior and I?m not > sure where to start debugging. All live nodes are running pbs_mom. The > system was working as expected with just one compute node. With more > than one it is having problems. I?m running pbs_sched. Can anyone please > help? > > Cheers, > > Rob > > *Rob Holmes*** > > *Environmental Scientist ? Catchments and Receiving Environments*** > > ** > > ** > > *BMT WBM Pty Ltd > *Level 8, 200 Creek Street > Brisbane QLD 4000 Australia > *P: *+61 7 3831 6744 > *F: * > *W: www.bmtwbm.com.au* > > Hi Rob Weird indeed. Maybe if you send more information, it will ring a bell. Here's a bunch of somewhat random possibilities. 1) It may help if you send the output of qmgr -c 'p s' of your nodes file, and of pbsnodes -a 2) Did you set 'np=XX' on the various lines of the nodes file? 3) Any chance that the queue[s] or the server is[are] configured with a maximum number of nodes or jobs? 4) Did you add any properties to the nodes, then perhaps used them to restrict job access to some nodes or queues? 5) Is pbs_mom running on all the four nodes that are up? 6) Any funny stuff in /var/log/messages on the server and perhaps on the compute nodes? 7) Likewise for $TORQUE/server_logs/YYYYMMDD [server], $TORQUE/sched_logs/YYYYMMDD [server], or $TORQUE/mom_logs/YYYYMMDD [compute nodes] ? 8) Any chance that the node names cannot be resolved? This is typically done in /etc/hosts on all nodes. Normally each node name is associated to a private [to the cluster] subnet. 9) Are the node's [Ethernet?] interfaces up and configured with the right IP addresses [as shown by ifconfig -a]? 10) Can you ping across every pair of nodes through the expected route [ping -R ...]? 11) Any firewalls perhaps blocking the access to the nodes? I hope this helps, Gus Correa From z99621 at aol.com Fri Jul 6 10:05:24 2012 From: z99621 at aol.com (z99621 at aol.com) Date: Fri, 6 Jul 2012 12:05:24 -0400 (EDT) Subject: [torqueusers] hi! Message-ID: <8CF29A8493467CA-11A0-32707@webmail-d071.sysops.aol.com> I guess your job search is going well. I wanted to alert you about a superb job opportunity in your locality. We have had many of our members take this opp & I am getting lots of fantastic stories. The blog has story featuring one of our members, Kelly R.. It will also you all you all the relevant information you need to get started. The article is at http://radiocultura.atwebpages.com/cottagecent/Robert_Cook74/?a=196088&s= and I guess the article will be on the hone page until tomorrow. Faithfully yours, From pankaj.dorlikar at gmail.com Sun Jul 8 11:25:18 2012 From: pankaj.dorlikar at gmail.com (pankaj dorlikar) Date: Sun, 8 Jul 2012 22:55:18 +0530 Subject: [torqueusers] standing reservation gets deleted Message-ID: Hi, We have maui-3.2.6p21 and Torque Server Version 2.5.8 on rhel 5.2 x86_64 nodes. we want to have daily reservation for the user john from 9pm to next day 9 am (12 hrs). we have following configuration for the this : SRCFG[john1] DAYS=ALL SRCFG[john1] STARTTIME=21:00:00 ENDTIME=01:09:00:00 SRCFG[john1] USERLIST=john SRCFG[john1] HOSTLIST=node1,node2,node3,node4 if the maui stops and starts at 00:01 daily, it is observed that the current standing reservation -03:01:00 start and 08:59:00 end (john1.0.0) gets deleted and instead of that the reservation is seen as : 21:00:00 start and 01:08:59:00 end (john1.0.0) which will start at 9 pm today but the current going rservation which has started yesterday and was going on gets deleted. is the reservation needs to be created from 9 pm to 12:01 am and then 12:01 am to 9 pm ? or is there any other solution? thanks, -- Pankaj V. Dorlikar From Rob.Holmes at bmtwbm.com.au Sun Jul 8 16:23:48 2012 From: Rob.Holmes at bmtwbm.com.au (Rob Holmes) Date: Sun, 8 Jul 2012 22:23:48 +0000 Subject: [torqueusers] nodes file - basic install problems In-Reply-To: <4FF7047F.9050500@ldeo.columbia.edu> References: <74C3EAAEAFC2E746A6BDC0F215CABBD70360D9@wbm-mail.bmt-wbm.local> <4FF7047F.9050500@ldeo.columbia.edu> Message-ID: <74C3EAAEAFC2E746A6BDC0F215CABBD7036BBC@wbm-mail.bmt-wbm.local> Hi Gus, Thanks for your reply, especially as my post wasn't particularly informative! Checking out the server logs showed me the way. There were two kinds of errors for node03: a warning of 'no route to host' and an error of 'connection to node03 is bad'. The problem was that node03 had its firewall up, and also that SElinux was switched on. I'm not sure why torque threw an error on some occasions but only a warning on others. As a double whammy, node12 had SElinux switched on too. I had switched all these off but forgot to make it permanent so they didn't survive a reboot. All is as it should be now and the system is working as expected. Thanks for your help, Rob BMT WBM Pty Ltd Level 8, 200 Creek Street Brisbane QLD 4000 Australia P: +61 7 3831 6744 F: W: www.bmtwbm.com.au E-mail confidentiality notice and disclaimer: The contents of this e-mail are intended for the use of the mail addressee(s) shown. If you are not that person, you are not allowed to read, action, copy, forward, distribute or disclose the contents and you should delete it from your system. BMT WBM accepts no liability for any errors or omissions in the content of this e-mail, nor does it accept liability for statements which are those of the author and clearly not made on behalf of the company. Commercial Terms and Conditions: Unless otherwise agreed by BMT WBM in writing, all services or products supplied by BMT WBM shall be subject to and governed by BMT WBM's standard terms and conditions, which are available upon request. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa Sent: Saturday, 7 July 2012 01:30 AM To: Torque Users Mailing List Subject: Re: [torqueusers] nodes file - basic install problems On 07/06/2012 12:58 AM, Rob Holmes wrote: > Hi, > > I'm installing a small HPC cluster at work, which I've never done > before and it's causing me problems. > > My nodes file contains 14 compute nodes named node01, node02, etc. At > the moment I just have four nodes switched on, with the remainder > shown as 'down' with pbsnodes -a. When I submit a number of jobs, jobs > are submitted to the first two nodes with the remaining two marked as > 'free', regardless of how many jobs are waiting to be submitted. Jobs > are kept in the queue until either of node01 or node02 come free, then > are run. node03 and node12 (the other two live nodes) never run a job. > > However, when I remove node01 for example (by commenting out node01 in > the nodes file and restarting pbs_server), jobs will run on node12. > Bizarrely, node03 is then marked as 'down' in pbsnodes -a. > > This is long but basically I'm getting a lot of odd behavior and I'm > not sure where to start debugging. All live nodes are running pbs_mom. > The system was working as expected with just one compute node. With > more than one it is having problems. I'm running pbs_sched. Can anyone > please help? > > Cheers, > > Rob > > *Rob Holmes*** > > *Environmental Scientist - Catchments and Receiving Environments*** > > ** > > ** > > *BMT WBM Pty Ltd > *Level 8, 200 Creek Street > Brisbane QLD 4000 Australia > *P: *+61 7 3831 6744 > *F: * > *W: www.bmtwbm.com.au* > > Hi Rob Weird indeed. Maybe if you send more information, it will ring a bell. Here's a bunch of somewhat random possibilities. 1) It may help if you send the output of qmgr -c 'p s' of your nodes file, and of pbsnodes -a 2) Did you set 'np=XX' on the various lines of the nodes file? 3) Any chance that the queue[s] or the server is[are] configured with a maximum number of nodes or jobs? 4) Did you add any properties to the nodes, then perhaps used them to restrict job access to some nodes or queues? 5) Is pbs_mom running on all the four nodes that are up? 6) Any funny stuff in /var/log/messages on the server and perhaps on the compute nodes? 7) Likewise for $TORQUE/server_logs/YYYYMMDD [server], $TORQUE/sched_logs/YYYYMMDD [server], or $TORQUE/mom_logs/YYYYMMDD [compute nodes] ? 8) Any chance that the node names cannot be resolved? This is typically done in /etc/hosts on all nodes. Normally each node name is associated to a private [to the cluster] subnet. 9) Are the node's [Ethernet?] interfaces up and configured with the right IP addresses [as shown by ifconfig -a]? 10) Can you ping across every pair of nodes through the expected route [ping -R ...]? 11) Any firewalls perhaps blocking the access to the nodes? I hope this helps, Gus Correa _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From pankaj.dorlikar at gmail.com Mon Jul 9 10:03:22 2012 From: pankaj.dorlikar at gmail.com (pankaj dorlikar) Date: Mon, 9 Jul 2012 21:33:22 +0530 Subject: [torqueusers] Fwd: standing reservation gets deleted In-Reply-To: References: Message-ID: can some pointers be given on this thanks ---------- Forwarded message ---------- From: pankaj dorlikar Date: Sun, 8 Jul 2012 22:55:18 +0530 Subject: standing reservation gets deleted To: torqueusers Hi, We have maui-3.2.6p21 and Torque Server Version 2.5.8 on rhel 5.2 x86_64 nodes. we want to have daily reservation for the user john from 9pm to next day 9 am (12 hrs). we have following configuration for the this : SRCFG[john1] DAYS=ALL SRCFG[john1] STARTTIME=21:00:00 ENDTIME=01:09:00:00 SRCFG[john1] USERLIST=john SRCFG[john1] HOSTLIST=node1,node2,node3,node4 if the maui stops and starts at 00:01 daily, it is observed that the current standing reservation -03:01:00 start and 08:59:00 end (john1.0.0) gets deleted and instead of that the reservation is seen as : 21:00:00 start and 01:08:59:00 end (john1.0.0) which will start at 9 pm today but the current going rservation which has started yesterday and was going on gets deleted. is the reservation needs to be created from 9 pm to 12:01 am and then 12:01 am to 9 am ? or is there any other solution? thanks, -- Pankaj V. Dorlikar -- Pankaj V. Dorlikar From dbeer at adaptivecomputing.com Mon Jul 9 10:26:05 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 9 Jul 2012 10:26:05 -0600 Subject: [torqueusers] Fwd: standing reservation gets deleted In-Reply-To: References: Message-ID: Pankaj, Just to clarify, this is the torque users list. Someone may know the answer to your question, but this isn't really the forum for problems with Maui. David On Mon, Jul 9, 2012 at 10:03 AM, pankaj dorlikar wrote: > can some pointers be given on this > > thanks > > > ---------- Forwarded message ---------- > From: pankaj dorlikar > Date: Sun, 8 Jul 2012 22:55:18 +0530 > Subject: standing reservation gets deleted > To: torqueusers > > Hi, > > We have maui-3.2.6p21 and Torque Server Version 2.5.8 on rhel 5.2 x86_64 > nodes. > we want to have daily reservation for the user john from 9pm to next > day 9 am (12 hrs). we have following configuration for the this : > > SRCFG[john1] DAYS=ALL > SRCFG[john1] STARTTIME=21:00:00 ENDTIME=01:09:00:00 > SRCFG[john1] USERLIST=john > SRCFG[john1] HOSTLIST=node1,node2,node3,node4 > > if the maui stops and starts at 00:01 daily, it is observed that the > current standing reservation -03:01:00 start and 08:59:00 end > (john1.0.0) gets deleted and instead of that the reservation is seen > as : > 21:00:00 start and 01:08:59:00 end (john1.0.0) which will start at 9 > pm today but the current going rservation which has started yesterday > and was going on gets deleted. is the reservation needs to be created > from 9 pm to 12:01 am and then 12:01 am to 9 am ? or is there any > other solution? > > thanks, > -- > Pankaj V. Dorlikar > > > > -- > Pankaj V. Dorlikar > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120709/79f3185f/attachment-0001.html From pankaj.dorlikar at gmail.com Tue Jul 10 07:58:39 2012 From: pankaj.dorlikar at gmail.com (pankaj dorlikar) Date: Tue, 10 Jul 2012 19:28:39 +0530 Subject: [torqueusers] Fwd: standing reservation gets deleted In-Reply-To: References: Message-ID: thank you David. On 7/9/12, David Beer wrote: > Pankaj, > > Just to clarify, this is the torque users list. Someone may know the answer > to your question, but this isn't really the forum for problems with Maui. > > David > > On Mon, Jul 9, 2012 at 10:03 AM, pankaj dorlikar > wrote: > >> can some pointers be given on this >> >> thanks >> >> >> ---------- Forwarded message ---------- >> From: pankaj dorlikar >> Date: Sun, 8 Jul 2012 22:55:18 +0530 >> Subject: standing reservation gets deleted >> To: torqueusers >> >> Hi, >> >> We have maui-3.2.6p21 and Torque Server Version 2.5.8 on rhel 5.2 x86_64 >> nodes. >> we want to have daily reservation for the user john from 9pm to next >> day 9 am (12 hrs). we have following configuration for the this : >> >> SRCFG[john1] DAYS=ALL >> SRCFG[john1] STARTTIME=21:00:00 ENDTIME=01:09:00:00 >> SRCFG[john1] USERLIST=john >> SRCFG[john1] HOSTLIST=node1,node2,node3,node4 >> >> if the maui stops and starts at 00:01 daily, it is observed that the >> current standing reservation -03:01:00 start and 08:59:00 end >> (john1.0.0) gets deleted and instead of that the reservation is seen >> as : >> 21:00:00 start and 01:08:59:00 end (john1.0.0) which will start at 9 >> pm today but the current going rservation which has started yesterday >> and was going on gets deleted. is the reservation needs to be created >> from 9 pm to 12:01 am and then 12:01 am to 9 am ? or is there any >> other solution? >> >> thanks, >> -- >> Pankaj V. Dorlikar >> >> >> >> -- >> Pankaj V. Dorlikar >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > > -- > David Beer | Software Engineer > Adaptive Computing > -- Pankaj V. Dorlikar From l.flis at cyf-kr.edu.pl Wed Jul 11 07:22:25 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Wed, 11 Jul 2012 15:22:25 +0200 Subject: [torqueusers] MUNGE authentication improvements [PATCH] Message-ID: <4FFD7E11.7080607@cyf-kr.edu.pl> Hi All, Thanks to the work of Mariusz Mamonski from Poznan Supercomputing and Networking Center I'm able to share some improvements that were made to MUNGE Authentication mechanism in TORQUE. In the attachment you will find patch: torque-munge-api-support-v3.patch.gz, it works with 2.5.11 and 2.5.12 version of Torque. The patch adds new "--enable-munge-library" configure option which turns on new Munge authorization based on API instead of external executables. The patch does not modify any of old munge authentication code. We just add alternative methods which are switched on by specifying configure option. By using munge functions directly via API we were able to get rid of expensive calls like popen (exec) and fsync used in the older method and gain significant speedups in client request processing. As the reslut of changes we've got a lot more responsiveness from pbs_server. Observed performance gain vary from 2x to more than 10x times depending on query types. The bigger the cluster the bigger performance gain you may expect. We have successfully tested the new implementation in our test environment. After verification on smaller cluster the patch is now in production since yesterday's afternoon. This cluster processes around 25k of jobs per day and no issues have been observed yet. -------------------------------------------------------------- We don't guarantee that it will work for you and take no responsibility for any damages it may cause. -------------------------------------------------------------- Despite above statement ;) it is worth trying. We did our best to ensure that is is cross-compatible with old munge-auth and error free. The most benefits however can be seen on bigger clusters where server is queried frequently (i.e. grid sites) HOWTO install: 1. Get torque sources torque 2.5.12.tar.gz 2. untar and apply patch $> tar -zxvf torque-2.5.12.tar.gz $> cd torque-2.5.12; patch -p1 < torque-munge-api-support-v3.patch 3. Regenerate configure script by invoking: $> autoconf NOTE: m4 in version 1.4.8 or newer is required. RHEL5 derivatives (like SL5) may require newer package: For scientific linux 5 we used: $> wget ftp://ftp.scientificlinux.org/linux/scientific/5x/SRPMS/SL/m4-1.4.8-1.src.rpm $> rpmbuild --rebuild m4-1.4.8-1.src.rpm $> rpm -Uvh ../path/to/m4-1.4.8-1.x86_64.rpm 4. Read README.munge in torque directory 5. Make sure you have munge-libs and munge-devel (library and headers) NOTE: munge-libs are LGPL sice 0.5.9, earlier versions are GPL 6. Configure and build torque $> ./configure --enable-munge-library && make Good luck & happy testing Any feedback is welcome. I wish you good performance gains :) -- Lukasz Flis ACC Cyfronet AGH Nawojki 11, 31-209 POLAND -------------- next part -------------- A non-text attachment was scrubbed... Name: torque-munge-api-support-v3.patch.gz Type: application/x-gzip Size: 5519 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120711/267e990c/attachment.gz From dbeer at adaptivecomputing.com Wed Jul 11 17:53:29 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 11 Jul 2012 17:53:29 -0600 Subject: [torqueusers] --enable-unixsockets Message-ID: Does anyone use this option? I would love to see it removed from future versions of TORQUE, but we won't do this if people depend on it. -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120711/473c265d/attachment.html From raines at nmr.mgh.harvard.edu Thu Jul 12 08:39:41 2012 From: raines at nmr.mgh.harvard.edu (Paul Raines) Date: Thu, 12 Jul 2012 10:39:41 -0400 (EDT) Subject: [torqueusers] torque/maui assigning jobs to full nodes when other nodes are free Message-ID: I just did a total reinstall on our batch cluster upgrading all nodes to CentOS6 and updating to torque-2.5.11 and maui-3.3.1 I have over 100 nodes and only a few jobs submitted so far but somehow jobs are getting Deferred being assigned to nodes that have jobs already running on them even though pleny of empty free nodes exist. ========================================================== checking job 1710 State: Idle EState: Deferred Creds: user:award group:award class:p30 qos:DEFAULT WallTime: 00:00:00 of 4:00:00:00 SubmitTime: Thu Jul 12 09:38:18 (Time Queued Total: 00:50:31 Eligible: 00:00:00) StartDate: -00:50:30 Thu Jul 12 09:38:19 Total Tasks: 4 Req[0] TaskCount: 4 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [nonGPU] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 MSG=cannot allocate node 'compute-0-6' to job - node not currently available (nps needed/free: 4/3, gpus needed/free: 0/0, joblist: 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)') Holds: Defer (hold reason: RMFailure) PE: 4.00 StartPriority: 103050 cannot select job 1710 for partition DEFAULT (job hold active) ========================================================== [root at launchpad ~]# pbsnodes -a compute-0-6 compute-0-6 state = job-exclusive np = 8 properties = nonGPU ntype = cluster jobs = 0/1021.launchpad.nmr.mgh.harvard.edu, 1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu, 3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu, 5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu, 7/1806.launchpad.nmr.mgh.harvard.edu status = rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu 1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu 1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122 9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux gpus = 0 ========================================================== All these Deferred jobs are trying to run on compute-0-6 ==================================================== BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 1710 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:18 1714 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:21 1715 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:22 1716 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:24 1717 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:25 1718 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:27 1726 tyler Deferred 1 4:00:00:00 Thu Jul 12 09:40:46 1761 lzollei Deferred 5 4:00:00:00 Thu Jul 12 09:57:36 1764 award Deferred 4 4:00:00:00 Thu Jul 12 09:58:54 1777 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:04:18 1779 tyler Deferred 1 4:00:00:00 Thu Jul 12 10:04:36 1784 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:07:39 1791 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:11:00 1803 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:17:43 1814 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:21:04 ==================================================== Some jobs we submit still get run on other nodes just fine. It seems random what is getting assigned to compute-0-6 and then deferred. There are lots of identical configured nodes free. I can force these jobs to run on other nodes with qrun by hand but what is going on? Here is my maui config which worked fine in my older setup ========================================================== RMPOLLINTERVAL 00:00:30 SERVERHOST launchpad.nmr.mgh.harvard.edu SERVERPORT 40559 SERVERMODE NORMAL ADMINHOST launchpad.nmr.mgh.harvard.edu RMCFG[base] TYPE=PBS ADMIN1 maui root ADMIN3 ALL LOGFILE /var/spool/maui/log/maui.log LOGFILEMAXSIZE 1000000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 CLASSWEIGHT 10 USERCFG[DEFAULT] MAXIPROC=8 CLASSCFG[default] MAXPROCPERUSER=150 CLASSCFG[matlab] MAXPROCPERUSER=60 CLASSCFG[max10] MAXPROCPERUSER=10 CLASSCFG[max20] MAXPROCPERUSER=20 CLASSCFG[max50] MAXPROCPERUSER=50 CLASSCFG[max75] MAXPROCPERUSER=75 CLASSCFG[max100] MAXPROCPERUSER=100 CLASSCFG[max200] MAXPROCPERUSER=200 CLASSCFG[p5] MAXPROCPERUSER=5000 CLASSCFG[p10] MAXPROCPERUSER=5000 CLASSCFG[p20] MAXPROCPERUSER=5000 CLASSCFG[p30] MAXPROCPERUSER=5000 CLASSCFG[p40] MAXPROCPERUSER=5000 CLASSCFG[p50] MAXPROCPERUSER=30 CLASSCFG[p60] MAXPROCPERUSER=20 CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250 CLASSCFG[GPU] MAXPROCPERUSER=5000 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT' ENFORCERESOURCELIMITS OFF ENABLEMULTIREQJOBS TRUE ==================================================== There is nothing in the queue configs that would favor any nodes over any other. --------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. From raines at nmr.mgh.harvard.edu Thu Jul 12 08:45:16 2012 From: raines at nmr.mgh.harvard.edu (Paul Raines) Date: Thu, 12 Jul 2012 10:45:16 -0400 (EDT) Subject: [torqueusers] torque/maui assigning jobs to full nodes when other nodes are free In-Reply-To: References: Message-ID: As a followup, after running qrun on a job to get it to run on another node, maui still seems confused thinking it is still allocated to compute-0-6 as this output shows: [root at launchpad ~]# checkjob 1713 checking job 1713 State: Running Creds: user:award group:award class:p30 qos:DEFAULT WallTime: 00:06:16 of 4:00:00:00 SubmitTime: Thu Jul 12 09:38:19 (Time Queued Total: 00:57:50 Eligible: 00:00:00) StartTime: Thu Jul 12 10:36:09 StartDate: -1:03:59 Thu Jul 12 09:38:20 Total Tasks: 4 Req[0] TaskCount: 5 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [nonGPU] NodeCount: 2 Allocated Nodes: [compute-0-6:4][compute-0-16:1] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Reservation '1713' (-00:06:10 -> 3:23:53:50 Duration: 4:00:00:00) Messages: cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 MSG=cannot allocate node 'compute-0-6' to job - node not currently available (nps needed/free: 4/3, gpus needed/free: 0/0, joblist: 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)' PE: 5.00 StartPriority: 103003 [root at launchpad ~]# qstat -n 1713 launchpad.nmr.mgh.harvard.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 1713.launchpad.n award p30 pbsjob_1420 10808 1 4 -- 96:00 R 00:05 compute-0-16/3+compute-0-16/2+compute-0-16/1+compute-0-16/0 -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Thu, 12 Jul 2012 10:39am, Paul Raines wrote: > > I just did a total reinstall on our batch cluster upgrading all nodes > to CentOS6 and updating to torque-2.5.11 and maui-3.3.1 > > I have over 100 nodes and only a few jobs submitted so far but > somehow jobs are getting Deferred being assigned to nodes that > have jobs already running on them even though pleny of empty > free nodes exist. > > ========================================================== > checking job 1710 > > State: Idle EState: Deferred > Creds: user:award group:award class:p30 qos:DEFAULT > WallTime: 00:00:00 of 4:00:00:00 > SubmitTime: Thu Jul 12 09:38:18 > (Time Queued Total: 00:50:31 Eligible: 00:00:00) > > StartDate: -00:50:30 Thu Jul 12 09:38:19 > Total Tasks: 4 > > Req[0] TaskCount: 4 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [nonGPU] > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: > 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 MSG=cannot > allocate node 'compute-0-6' to job - node not currently available (nps > needed/free: 4/3, gpus needed/free: 0/0, joblist: > 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)') > Holds: Defer (hold reason: RMFailure) > PE: 4.00 StartPriority: 103050 > cannot select job 1710 for partition DEFAULT (job hold active) > ========================================================== > > [root at launchpad ~]# pbsnodes -a compute-0-6 > compute-0-6 > state = job-exclusive > np = 8 > properties = nonGPU > ntype = cluster > jobs = 0/1021.launchpad.nmr.mgh.harvard.edu, > 1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu, > 3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu, > 5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu, > 7/1806.launchpad.nmr.mgh.harvard.edu > status = > rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu > 1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu > 1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122 > 9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 #1 > SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux > gpus = 0 > > ========================================================== > > All these Deferred jobs are trying to run on compute-0-6 > > ==================================================== > BLOCKED JOBS---------------- > JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME > > 1710 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:18 > 1714 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:21 > 1715 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:22 > 1716 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:24 > 1717 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:25 > 1718 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:27 > 1726 tyler Deferred 1 4:00:00:00 Thu Jul 12 09:40:46 > 1761 lzollei Deferred 5 4:00:00:00 Thu Jul 12 09:57:36 > 1764 award Deferred 4 4:00:00:00 Thu Jul 12 09:58:54 > 1777 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:04:18 > 1779 tyler Deferred 1 4:00:00:00 Thu Jul 12 10:04:36 > 1784 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:07:39 > 1791 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:11:00 > 1803 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:17:43 > 1814 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:21:04 > ==================================================== > > Some jobs we submit still get run on other nodes just fine. It seems > random what is getting assigned to compute-0-6 and then deferred. > > There are lots of identical configured nodes free. I can force these > jobs to run on other nodes with qrun by hand but what is going on? > > Here is my maui config which worked fine in my older setup > ========================================================== > RMPOLLINTERVAL 00:00:30 > SERVERHOST launchpad.nmr.mgh.harvard.edu > SERVERPORT 40559 > SERVERMODE NORMAL > ADMINHOST launchpad.nmr.mgh.harvard.edu > RMCFG[base] TYPE=PBS > ADMIN1 maui root > ADMIN3 ALL > LOGFILE /var/spool/maui/log/maui.log > LOGFILEMAXSIZE 1000000000 > LOGLEVEL 3 > QUEUETIMEWEIGHT 1 > CLASSWEIGHT 10 > USERCFG[DEFAULT] MAXIPROC=8 > CLASSCFG[default] MAXPROCPERUSER=150 > CLASSCFG[matlab] MAXPROCPERUSER=60 > CLASSCFG[max10] MAXPROCPERUSER=10 > CLASSCFG[max20] MAXPROCPERUSER=20 > CLASSCFG[max50] MAXPROCPERUSER=50 > CLASSCFG[max75] MAXPROCPERUSER=75 > CLASSCFG[max100] MAXPROCPERUSER=100 > CLASSCFG[max200] MAXPROCPERUSER=200 > CLASSCFG[p5] MAXPROCPERUSER=5000 > CLASSCFG[p10] MAXPROCPERUSER=5000 > CLASSCFG[p20] MAXPROCPERUSER=5000 > CLASSCFG[p30] MAXPROCPERUSER=5000 > CLASSCFG[p40] MAXPROCPERUSER=5000 > CLASSCFG[p50] MAXPROCPERUSER=30 > CLASSCFG[p60] MAXPROCPERUSER=20 > CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250 > CLASSCFG[GPU] MAXPROCPERUSER=5000 > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEALLOCATIONPOLICY PRIORITY > NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT' > ENFORCERESOURCELIMITS OFF > ENABLEMULTIREQJOBS TRUE > ==================================================== > > There is nothing in the queue configs that would favor any nodes over > any other. > > --------------------------------------------------------------- > Paul Raines http://help.nmr.mgh.harvard.edu > MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging > 149 (2301) 13th Street Charlestown, MA 02129 USA > > > > The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. From raines at nmr.mgh.harvard.edu Thu Jul 12 08:48:47 2012 From: raines at nmr.mgh.harvard.edu (Paul Raines) Date: Thu, 12 Jul 2012 10:48:47 -0400 (EDT) Subject: [torqueusers] torque/maui assigning jobs to full nodes when other nodes are free In-Reply-To: References: Message-ID: And another followup, I am now getting new jobs submitted getting deferred because they are assigned to nodes that the jobs I 'qrun'ed were run on =========================================================== [root at launchpad ~]# checkjob 1850 checking job 1850 State: Idle EState: Deferred Creds: user:lzollei group:lzollei class:default qos:DEFAULT WallTime: 00:00:00 of 4:00:00:00 SubmitTime: Thu Jul 12 10:37:49 (Time Queued Total: 00:08:40 Eligible: 00:00:01) StartDate: -00:08:38 Thu Jul 12 10:37:51 Total Tasks: 5 Req[0] TaskCount: 5 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [nonGPU] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-16 MSG=cannot allocate node 'compute-0-16' to job - node not currently available (nps needed/free: 5/4, gpus needed/free: 0/0, joblist: 1713.launchpad.nmr.mgh.harvard.edu:0,1713.launchpad.nmr.mgh.harvard.edu:1,1713.launchpad.nmr.mgh.harvard.edu:2,1713.launchpad.nmr.mgh.harvard.edu:3)') Holds: Defer (hold reason: RMFailure) PE: 5.00 StartPriority: 101000 cannot select job 1850 for partition DEFAULT (job hold active) =================================================== It is like maui is not getting the memo about where jobs are getting run so what nodes are free. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Thu, 12 Jul 2012 10:45am, Paul Raines wrote: > As a followup, after running qrun on a job to get it to run on another node, > maui still seems confused thinking it is still allocated to compute-0-6 as > this output shows: > > [root at launchpad ~]# checkjob 1713 > > > checking job 1713 > > State: Running > Creds: user:award group:award class:p30 qos:DEFAULT > WallTime: 00:06:16 of 4:00:00:00 > SubmitTime: Thu Jul 12 09:38:19 > (Time Queued Total: 00:57:50 Eligible: 00:00:00) > > StartTime: Thu Jul 12 10:36:09 > StartDate: -1:03:59 Thu Jul 12 09:38:20 > Total Tasks: 4 > > Req[0] TaskCount: 5 Partition: DEFAULT > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [nonGPU] > NodeCount: 2 > Allocated Nodes: > [compute-0-6:4][compute-0-16:1] > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Reservation '1713' (-00:06:10 -> 3:23:53:50 Duration: 4:00:00:00) > Messages: cannot start job - RM failure, rc: 15046, msg: 'Resource > temporarily unavailable REJHOST=compute-0-6 MSG=cannot allocate node > 'compute-0-6' to job - node not currently available (nps needed/free: 4/3, > gpus needed/free: 0/0, joblist: > 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)' > PE: 5.00 StartPriority: 103003 > > [root at launchpad ~]# qstat -n 1713 > > launchpad.nmr.mgh.harvard.edu: > Req'd > Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > 1713.launchpad.n award p30 pbsjob_1420 10808 1 4 > -- 96:00 R 00:05 > compute-0-16/3+compute-0-16/2+compute-0-16/1+compute-0-16/0 > > > -- Paul Raines (http://help.nmr.mgh.harvard.edu) > > > > On Thu, 12 Jul 2012 10:39am, Paul Raines wrote: > >> >> I just did a total reinstall on our batch cluster upgrading all nodes >> to CentOS6 and updating to torque-2.5.11 and maui-3.3.1 >> >> I have over 100 nodes and only a few jobs submitted so far but >> somehow jobs are getting Deferred being assigned to nodes that >> have jobs already running on them even though pleny of empty >> free nodes exist. >> >> ========================================================== >> checking job 1710 >> >> State: Idle EState: Deferred >> Creds: user:award group:award class:p30 qos:DEFAULT >> WallTime: 00:00:00 of 4:00:00:00 >> SubmitTime: Thu Jul 12 09:38:18 >> (Time Queued Total: 00:50:31 Eligible: 00:00:00) >> >> StartDate: -00:50:30 Thu Jul 12 09:38:19 >> Total Tasks: 4 >> >> Req[0] TaskCount: 4 Partition: ALL >> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >> Opsys: [NONE] Arch: [NONE] Features: [nonGPU] >> >> >> IWD: [NONE] Executable: [NONE] >> Bypass: 0 StartCount: 1 >> PartitionMask: [ALL] >> job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: >> 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 >> MSG=cannot allocate node 'compute-0-6' to job - node not currently >> available (nps needed/free: 4/3, gpus needed/free: 0/0, joblist: >> 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)') >> Holds: Defer (hold reason: RMFailure) >> PE: 4.00 StartPriority: 103050 >> cannot select job 1710 for partition DEFAULT (job hold active) >> ========================================================== >> >> [root at launchpad ~]# pbsnodes -a compute-0-6 >> compute-0-6 >> state = job-exclusive >> np = 8 >> properties = nonGPU >> ntype = cluster >> jobs = 0/1021.launchpad.nmr.mgh.harvard.edu, >> 1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu, >> 3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu, >> 5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu, >> 7/1806.launchpad.nmr.mgh.harvard.edu >> status = >> rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu >> 1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu >> 1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122 >> 9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 >> #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux >> gpus = 0 >> >> ========================================================== >> >> All these Deferred jobs are trying to run on compute-0-6 >> >> ==================================================== >> BLOCKED JOBS---------------- >> JOBNAME USERNAME STATE PROC WCLIMIT >> QUEUETIME >> >> 1710 award Deferred 4 4:00:00:00 Thu Jul 12 >> 09:38:18 >> 1714 award Deferred 4 4:00:00:00 Thu Jul 12 >> 09:38:21 >> 1715 award Deferred 4 4:00:00:00 Thu Jul 12 >> 09:38:22 >> 1716 award Deferred 4 4:00:00:00 Thu Jul 12 >> 09:38:24 >> 1717 award Deferred 4 4:00:00:00 Thu Jul 12 >> 09:38:25 >> 1718 award Deferred 4 4:00:00:00 Thu Jul 12 >> 09:38:27 >> 1726 tyler Deferred 1 4:00:00:00 Thu Jul 12 >> 09:40:46 >> 1761 lzollei Deferred 5 4:00:00:00 Thu Jul 12 >> 09:57:36 >> 1764 award Deferred 4 4:00:00:00 Thu Jul 12 >> 09:58:54 >> 1777 lzollei Deferred 5 4:00:00:00 Thu Jul 12 >> 10:04:18 >> 1779 tyler Deferred 1 4:00:00:00 Thu Jul 12 >> 10:04:36 >> 1784 lzollei Deferred 5 4:00:00:00 Thu Jul 12 >> 10:07:39 >> 1791 lzollei Deferred 5 4:00:00:00 Thu Jul 12 >> 10:11:00 >> 1803 lzollei Deferred 5 4:00:00:00 Thu Jul 12 >> 10:17:43 >> 1814 lzollei Deferred 5 4:00:00:00 Thu Jul 12 >> 10:21:04 >> ==================================================== >> >> Some jobs we submit still get run on other nodes just fine. It seems >> random what is getting assigned to compute-0-6 and then deferred. >> >> There are lots of identical configured nodes free. I can force these >> jobs to run on other nodes with qrun by hand but what is going on? >> >> Here is my maui config which worked fine in my older setup >> ========================================================== >> RMPOLLINTERVAL 00:00:30 >> SERVERHOST launchpad.nmr.mgh.harvard.edu >> SERVERPORT 40559 >> SERVERMODE NORMAL >> ADMINHOST launchpad.nmr.mgh.harvard.edu >> RMCFG[base] TYPE=PBS >> ADMIN1 maui root >> ADMIN3 ALL >> LOGFILE /var/spool/maui/log/maui.log >> LOGFILEMAXSIZE 1000000000 >> LOGLEVEL 3 >> QUEUETIMEWEIGHT 1 >> CLASSWEIGHT 10 >> USERCFG[DEFAULT] MAXIPROC=8 >> CLASSCFG[default] MAXPROCPERUSER=150 >> CLASSCFG[matlab] MAXPROCPERUSER=60 >> CLASSCFG[max10] MAXPROCPERUSER=10 >> CLASSCFG[max20] MAXPROCPERUSER=20 >> CLASSCFG[max50] MAXPROCPERUSER=50 >> CLASSCFG[max75] MAXPROCPERUSER=75 >> CLASSCFG[max100] MAXPROCPERUSER=100 >> CLASSCFG[max200] MAXPROCPERUSER=200 >> CLASSCFG[p5] MAXPROCPERUSER=5000 >> CLASSCFG[p10] MAXPROCPERUSER=5000 >> CLASSCFG[p20] MAXPROCPERUSER=5000 >> CLASSCFG[p30] MAXPROCPERUSER=5000 >> CLASSCFG[p40] MAXPROCPERUSER=5000 >> CLASSCFG[p50] MAXPROCPERUSER=30 >> CLASSCFG[p60] MAXPROCPERUSER=20 >> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250 >> CLASSCFG[GPU] MAXPROCPERUSER=5000 >> BACKFILLPOLICY FIRSTFIT >> RESERVATIONPOLICY CURRENTHIGHEST >> NODEALLOCATIONPOLICY PRIORITY >> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT' >> ENFORCERESOURCELIMITS OFF >> ENABLEMULTIREQJOBS TRUE >> ==================================================== >> >> There is nothing in the queue configs that would favor any nodes over >> any other. >> >> --------------------------------------------------------------- >> Paul Raines http://help.nmr.mgh.harvard.edu >> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging >> 149 (2301) 13th Street Charlestown, MA 02129 USA >> >> >> >> > The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail. From nt_mahmood at yahoo.com Thu Jul 12 13:22:12 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 12 Jul 2012 12:22:12 -0700 (PDT) Subject: [torqueusers] qsub doesn't find the executable file Message-ID: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> Dear all, I am trying to qsub an executable file but it doesn't find the file! -rwxrwxrwx? 1 u1 users 76277 Jul 12 23:08 msim -rw-r--r--? 1 u1 users??? 89 Jul 12 23:45 tor tor file contains #PBS -N test #PBS -V #PBS -q orcaq #PBS -l nodes=1 #PBS -j oe cd $PBS_O_WORKDIR msim However the output file says: /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found binary file is not corrupted, since when run ./msim, it shows the usage. ? any idea for that? // Naderan *Mahmood; From sm4082 at nyu.edu Thu Jul 12 13:28:13 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 12 Jul 2012 15:28:13 -0400 Subject: [torqueusers] qsub doesn't find the executable file In-Reply-To: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> References: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> Message-ID: Hi Naderan, Why don't you do the same, mean using ./msim in stead of msim in the tor? I think, since it is not a system command, you need to say look for in the working directory, which you already cd'ed to, with ./executable. I don't know this is the reason but I always do ./executable and I don't have any problems. Sreedhar. On Jul 12, 2012, at 3:22 PM, Mahmood Naderan wrote: > Dear all, > I am trying to qsub an executable file but it doesn't find the file! > > -rwxrwxrwx 1 u1 users 76277 Jul 12 23:08 msim > -rw-r--r-- 1 u1 users 89 Jul 12 23:45 tor > > > tor file contains > > #PBS -N test > #PBS -V > #PBS -q orcaq > #PBS -l nodes=1 > #PBS -j oe > cd $PBS_O_WORKDIR > msim > > > However the output file says: > /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found > > > binary file is not corrupted, since when run ./msim, it shows the usage. > > any idea for that? > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From chris.evert at geokinetics.com Thu Jul 12 14:03:11 2012 From: chris.evert at geokinetics.com (Chris Evert) Date: Thu, 12 Jul 2012 15:03:11 -0500 Subject: [torqueusers] qsub doesn't find the executable file In-Reply-To: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> References: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> Message-ID: <4FFF2D7F.40201@geokinetics.com> Naderan Mahmood, It looks like . is not in your path. If msim is in $PBS_O_WORKDIR, then ./msim should work (like from the command line.) Another option is to provide the full path of msim in the file tor. Hope this helps, Chris -- Chris Evert Geokinetics, Inc. Houston, TX On 07/12/2012 02:22 PM, Mahmood Naderan wrote: > Dear all, > I am trying to qsub an executable file but it doesn't find the file! > > -rwxrwxrwx 1 u1 users 76277 Jul 12 23:08 msim > -rw-r--r-- 1 u1 users 89 Jul 12 23:45 tor > > > tor file contains > > #PBS -N test > #PBS -V > #PBS -q orcaq > #PBS -l nodes=1 > #PBS -j oe > cd $PBS_O_WORKDIR > msim > > > However the output file says: > /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found > > > binary file is not corrupted, since when run ./msim, it shows the usage. > > any idea for that? > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From imoverclocked at gmail.com Wed Jul 11 22:36:15 2012 From: imoverclocked at gmail.com (Tim Spriggs) Date: Wed, 11 Jul 2012 21:36:15 -0700 Subject: [torqueusers] pbsmake -- a way to specify jobs with dependencies Message-ID: Hi All, I've been working with torque/maui and my user base for a while and I've found the ability to submit jobs with dependencies is pretty useful. However, it's also pretty messy if you have any kind of less-than-trivial dependency graph and you are trying to capture job IDs via a shell. Given some other push/pull factors I decided to develop a utility that takes a make-like syntax where each "make" target gets turned into a job and submitted to the job scheduler. Here is a quick example: --- > cat test #!/usr/bin/pbsmake -f 123: testing @Output_Path localhost:~/job_outputs/ echo 123 date testing: @Output_Path localhost:~/job_outputs/ echo testing date sleep 2 > ./test testing(41672.torqueserver...) scheduled 123(41673.torqueserver...) scheduled > sleep 4 > cat ~/job_outputs/testing.o41672 ~/job_outputs/123.o41673 testing Wed Jul 11 20:58:57 MST 2012 123 Wed Jul 11 20:59:00 MST 2012 --- Of course, other parameters can be specified per-target such as queue, resource limits, mail settings, etc. pbsmake is written against the python pbs module and uses a single connection to send all jobs to the scheduler which makes it fairly efficient for sending many jobs at once. pbsmake supports wildcards with wildcard dependencies. This allows all kinds of mischievous stuff to be written (exercise best left to the user) and a quick example of a wildcard target: --- #!/usr/bin/pbsmake -f a-%: % @Output_Path localhost:~/job_outputs/ echo a ${pm_target_match} b-%: % @Output_Path localhost:~/job_outputs/ echo b ${pm_target_match} c-%: % @Output_Path localhost:~/job_outputs/ echo c ${pm_target_match} d: @Output_Path localhost:~/job_outputs/ echo d > ./test-wild a-b-c-d d(41682.torqueserver...) scheduled c-d(41683.torqueserver...) scheduled b-c-d(41684.torqueserver...) scheduled a-b-c-d(41685.torqueserver...) scheduled > cat ~/job_outputs/*.o4168[2345] a b-c-d b c-d c d d --- Finally, to make debugging targets with dependencies and wildcards much easier, there is a graphviz output mode that will output a digraph that can be piped to dotty (or other graphviz tools) to view the dependencies before they are actually sent out. Yes, it will visualize circular dependencies appropriately: --- > ./test-wild -d a-b-c-d digraph pbsmakefile { t_0 -> t_1; t_2 -> t_0; t_3 -> t_2; t_2 [label="b-c-d"]; t_3 [label="a-b-c-d"]; t_0 [label="c-d"]; t_1 [label="d"]; } --- The above command can be invoked as "./test-wild -d a-b-c-d | dotty -" to popup an appropriate graph. For those of you who are still reading, congratulations! You win the URL to the released versions of pbsmake as well as the development repository for it: http://pirlwww.lpl.arizona.edu/~tims/pbsmake/ https://github.com/imoverclocked/pbsmake I'd like to take a second to thank Chris Van Horne for all of the development work he has done to get pbsmake where it is now while also putting up with my sometimes much less than rock solid plans for how pbsmake should behave in certain corner cases and turning them into specs and code. Hopefully others may find this useful in their work with torque/pbs. Cheers, -Tim From bdandrus at nps.edu Thu Jul 12 14:35:24 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 12 Jul 2012 20:35:24 +0000 Subject: [torqueusers] openmpi failing when run under torque Message-ID: All, I am upgrading our cluster to Centos6 I have having great grief in running a simple mpi program. It runs fine under a direct login to the node, however if I try running it under an interactive session, it segfaults and core dumps. Running when directly ssh-ing to the node: ============================================ [bdandrus at compute-3-1 OPENMPI]$ mpirun -np 4 ./a.out Process 0 starting to receive 1: compute-3-1 says it is process 1 of 4. received 2: compute-3-1 says it is process 2 of 4. received 3: compute-3-1 says it is process 3 of 4. Received ============================================= Running under a 'qsub -I ': ============================================ [bdandrus at compute-3-1 OPENMPI]$ mpirun -np 4 ./a.out [compute-3-1:31659] *** Process received signal *** [compute-3-1:31659] Signal: Segmentation fault (11) [compute-3-1:31659] Signal code: Address not mapped (1) [compute-3-1:31659] Failing at address: 0x40 [compute-3-1:31659] [ 0] /lib64/libpthread.so.0(+0xf500) [0x7ffff6b71500] [compute-3-1:31659] [ 1] /lib64/libc.so.6(_IO_vfprintf+0x3679) [0x7ffff6817389] [compute-3-1:31659] [ 2] /lib64/libc.so.6(vasprintf+0xba) [0x7ffff683e8da] [compute-3-1:31659] [ 3] /opt/openmpi/1.6/lib/libmpi.so.1(opal_show_help_vstring+0x333) [0x7ffff7b51dc3] [compute-3-1:31659] [ 4] /opt/openmpi/1.6/lib/libmpi.so.1(orte_show_help+0xac) [0x7ffff7ae19fc] [compute-3-1:31659] [ 5] /opt/openmpi/1.6/lib/openmpi/mca_btl_openib.so(+0xb0ba) [0x7ffff30d60ba] [compute-3-1:31659] [ 6] /opt/openmpi/1.6/lib/openmpi/mca_mpool_rdma.so(+0x15ff) [0x7ffff49715ff] [compute-3-1:31659] [ 7] /opt/openmpi/1.6/lib/openmpi/mca_mpool_rdma.so(mca_mpool_rdma_alloc+0xa9) [0x7ffff49720c9] [compute-3-1:31659] [ 8] /opt/openmpi/1.6/lib/libmpi.so.1(ompi_free_list_grow+0x280) [0x7ffff7a81980] [compute-3-1:31659] [ 9] /opt/openmpi/1.6/lib/openmpi/mca_btl_openib.so(+0xc34a) [0x7ffff30d734a] [compute-3-1:31659] [10] /opt/openmpi/1.6/lib/openmpi/mca_btl_openib.so(+0xfb6e) [0x7ffff30dab6e] [compute-3-1:31659] [11] /opt/openmpi/1.6/lib/libmpi.so.1(mca_btl_base_select+0x114) [0x7ffff7ac3764] [compute-3-1:31659] [12] /opt/openmpi/1.6/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12) [0x7ffff39199d2] [compute-3-1:31659] [13] /opt/openmpi/1.6/lib/libmpi.so.1(mca_bml_base_init+0x99) [0x7ffff7ac2f49] [compute-3-1:31659] [14] /opt/openmpi/1.6/lib/openmpi/mca_pml_ob1.so(+0x4ea0) [0x7ffff3d23ea0] [compute-3-1:31659] [15] /opt/openmpi/1.6/lib/libmpi.so.1(mca_pml_base_select+0x1e4) [0x7ffff7ad2404] [compute-3-1:31659] [16] /opt/openmpi/1.6/lib/libmpi.so.1(ompi_mpi_init+0x3ca) [0x7ffff7a96dba] [compute-3-1:31659] [17] /opt/openmpi/1.6/lib/libmpi.so.1(MPI_Init+0x170) [0x7ffff7aacf00] [compute-3-1:31659] [18] ./a.out(main+0x4f) [0x400ba3] [compute-3-1:31659] [19] /lib64/libc.so.6(__libc_start_main+0xfd) [0x7ffff67eecdd] [compute-3-1:31659] [20] ./a.out() [0x400a99] [compute-3-1:31659] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 31659 on node compute-3-1 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- =========================================================== This is under torque 3.0.5 using the torque scheduler as well (for testing). Any ideas what may be going on here? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From nt_mahmood at yahoo.com Fri Jul 13 00:31:58 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 12 Jul 2012 23:31:58 -0700 (PDT) Subject: [torqueusers] qsub doesn't find the executable file In-Reply-To: <4FFF2D7F.40201@geokinetics.com> References: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> <4FFF2D7F.40201@geokinetics.com> Message-ID: <1342161118.87344.YahooMailNeo@web111717.mail.gq1.yahoo.com> Chris and Sreedha, none of the proposed methods worked. /var/spool/pbs/mom_priv/jobs/61251.hpclab.orca.SC: line 8: ./msim: No such file or directory and /var/spool/pbs/mom_priv/jobs/61252.hpclab.orca.SC: line 8: /home/u1/workspace/msim: No such file or directory any more idea? Actually I myself have not faced such problem. One of our users encountered that and now I have no more idea about that. ? // Naderan *Mahmood; ----- Original Message ----- From: Chris Evert To: torqueusers at supercluster.org Cc: Sent: Friday, July 13, 2012 12:33 AM Subject: Re: [torqueusers] qsub doesn't find the executable file Naderan Mahmood, It looks like . is not in your path.? If msim is in $PBS_O_WORKDIR, then ./msim should work (like from the command line.) Another option is to provide the full path of msim in the file tor. Hope this helps, Chris -- Chris Evert Geokinetics, Inc. Houston, TX On 07/12/2012 02:22 PM, Mahmood Naderan wrote: > Dear all, > I am trying to qsub an executable file but it doesn't find the file! > > -rwxrwxrwx? 1 u1 users 76277 Jul 12 23:08 msim > -rw-r--r--? 1 u1 users? ? 89 Jul 12 23:45 tor > > > tor file contains > > #PBS -N test > #PBS -V > #PBS -q orcaq > #PBS -l nodes=1 > #PBS -j oe > cd $PBS_O_WORKDIR > msim > > > However the output file says: > /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found > > > binary file is not corrupted, since when run ./msim, it shows the usage. > > any idea for that? > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From nt_mahmood at yahoo.com Fri Jul 13 00:41:23 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 12 Jul 2012 23:41:23 -0700 (PDT) Subject: [torqueusers] qsub doesn't find the executable file In-Reply-To: <1342161118.87344.YahooMailNeo@web111717.mail.gq1.yahoo.com> References: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> <4FFF2D7F.40201@geokinetics.com> <1342161118.87344.YahooMailNeo@web111717.mail.gq1.yahoo.com> Message-ID: <1342161683.74629.YahooMailNeo@web111719.mail.gq1.yahoo.com> I think I found the problem but don't know how to fix that. /home is shared via NFS. So I ssh to a node (other than server). u1 at orca:~/workspace$ ssh ws05 Last login: Sat Jun? 4 14:22:08 2011 from hpclab.orca u1 at ws05:~$ ls workspace u1 at ws05:~$ cd workspace/ u1 at ws05:~/workspace$ ls -l msim -rwxr-xr-x 1 u1 users 81179 Jul 12? 2012 msim u1 at ws05:~/workspace$ ./msim -bash: ./msim: No such file or directory It seems that on other nodes, it can not execute! // Naderan *Mahmood; ----- Original Message ----- From: Mahmood Naderan To: Torque Users Mailing List Cc: Sent: Friday, July 13, 2012 11:01 AM Subject: Re: [torqueusers] qsub doesn't find the executable file Chris and Sreedha, none of the proposed methods worked. /var/spool/pbs/mom_priv/jobs/61251.hpclab.orca.SC: line 8: ./msim: No such file or directory and /var/spool/pbs/mom_priv/jobs/61252.hpclab.orca.SC: line 8: /home/u1/workspace/msim: No such file or directory any more idea? Actually I myself have not faced such problem. One of our users encountered that and now I have no more idea about that. ? // Naderan *Mahmood; ----- Original Message ----- From: Chris Evert To: torqueusers at supercluster.org Cc: Sent: Friday, July 13, 2012 12:33 AM Subject: Re: [torqueusers] qsub doesn't find the executable file Naderan Mahmood, It looks like . is not in your path.? If msim is in $PBS_O_WORKDIR, then ./msim should work (like from the command line.) Another option is to provide the full path of msim in the file tor. Hope this helps, Chris -- Chris Evert Geokinetics, Inc. Houston, TX On 07/12/2012 02:22 PM, Mahmood Naderan wrote: > Dear all, > I am trying to qsub an executable file but it doesn't find the file! > > -rwxrwxrwx? 1 u1 users 76277 Jul 12 23:08 msim > -rw-r--r--? 1 u1 users? ? 89 Jul 12 23:45 tor > > > tor file contains > > #PBS -N test > #PBS -V > #PBS -q orcaq > #PBS -l nodes=1 > #PBS -j oe > cd $PBS_O_WORKDIR > msim > > > However the output file says: > /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found > > > binary file is not corrupted, since when run ./msim, it shows the usage. > > any idea for that? > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From L.S.Lowe at bham.ac.uk Fri Jul 13 02:36:45 2012 From: L.S.Lowe at bham.ac.uk (Lawrence Lowe) Date: Fri, 13 Jul 2012 09:36:45 +0100 (BST) Subject: [torqueusers] qsub doesn't find the executable file In-Reply-To: <1342161683.74629.YahooMailNeo@web111719.mail.gq1.yahoo.com> References: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> <4FFF2D7F.40201@geokinetics.com> <1342161118.87344.YahooMailNeo@web111717.mail.gq1.yahoo.com> <1342161683.74629.YahooMailNeo@web111719.mail.gq1.yahoo.com> Message-ID: Hi, maybe the NFS filesystem is mounted with the "noexec" option. LSL. On Thu, 12 Jul 2012, Mahmood Naderan wrote: > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun? 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12? 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Mahmood Naderan > To: Torque Users Mailing List > Cc: > Sent: Friday, July 13, 2012 11:01 AM > Subject: Re: [torqueusers] qsub doesn't find the executable file > > Chris and Sreedha, > none of the proposed methods worked. > > /var/spool/pbs/mom_priv/jobs/61251.hpclab.orca.SC: line 8: ./msim: No such file or directory > > and > > /var/spool/pbs/mom_priv/jobs/61252.hpclab.orca.SC: line 8: /home/u1/workspace/msim: No such file or directory > > any more idea? > Actually I myself have not faced such problem. One of our users encountered that and now I have no more idea about that. > > ? > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Chris Evert > To: torqueusers at supercluster.org > Cc: > Sent: Friday, July 13, 2012 12:33 AM > Subject: Re: [torqueusers] qsub doesn't find the executable file > > Naderan Mahmood, > > It looks like . is not in your path.? If msim is in $PBS_O_WORKDIR, then > ./msim should work (like from the command line.) > > Another option is to provide the full path of msim in the file tor. > > Hope this helps, > Chris > -- > Chris Evert > Geokinetics, Inc. > Houston, TX > > On 07/12/2012 02:22 PM, Mahmood Naderan wrote: >> Dear all, >> I am trying to qsub an executable file but it doesn't find the file! >> >> -rwxrwxrwx? 1 u1 users 76277 Jul 12 23:08 msim >> -rw-r--r--? 1 u1 users? ? 89 Jul 12 23:45 tor >> >> >> tor file contains >> >> #PBS -N test >> #PBS -V >> #PBS -q orcaq >> #PBS -l nodes=1 >> #PBS -j oe >> cd $PBS_O_WORKDIR >> msim >> >> >> However the output file says: >> /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found >> >> >> binary file is not corrupted, since when run ./msim, it shows the usage. >> >> any idea for that? >> >> // Naderan *Mahmood; >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From philippe.Weill at latmos.ipsl.fr Fri Jul 13 02:39:34 2012 From: philippe.Weill at latmos.ipsl.fr (Philippe Weill) Date: Fri, 13 Jul 2012 10:39:34 +0200 Subject: [torqueusers] qsub doesn't find the executable file In-Reply-To: <1342161683.74629.YahooMailNeo@web111719.mail.gq1.yahoo.com> References: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> <4FFF2D7F.40201@geokinetics.com> <1342161118.87344.YahooMailNeo@web111717.mail.gq1.yahoo.com> <1342161683.74629.YahooMailNeo@web111719.mail.gq1.yahoo.com> Message-ID: <4FFFDEC6.5030502@latmos.ipsl.fr> Le 13/07/2012 08:41, Mahmood Naderan a ?crit : > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > could you do a file ./msim -- Weill Philippe - Administrateur Systeme et Reseaux CNRS/UPMC/IPSL LATMOS (UMR 8190) Tour 45/46 3e Etage B302 - 4 Place Jussieu - 75252 Paris Cedex 05 - FRANCE Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759 Fax:+33 0144273776 From nt_mahmood at yahoo.com Fri Jul 13 03:31:37 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Fri, 13 Jul 2012 02:31:37 -0700 (PDT) Subject: [torqueusers] qsub doesn't find the executable file In-Reply-To: <4FFFDEC6.5030502@latmos.ipsl.fr> References: <1342120932.63969.YahooMailNeo@web111712.mail.gq1.yahoo.com> <4FFF2D7F.40201@geokinetics.com> <1342161118.87344.YahooMailNeo@web111717.mail.gq1.yahoo.com> <1342161683.74629.YahooMailNeo@web111719.mail.gq1.yahoo.com> <4FFFDEC6.5030502@latmos.ipsl.fr> Message-ID: <1342171897.79889.YahooMailNeo@web111722.mail.gq1.yahoo.com> ok, file command points the right direction. msim: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for NU/Linux 2.6.15, BuildID[sha1]=0xcf66490c90387ce270d14bf4f4e3a01294935a36, not stripped So the problem is that I want to run a 32bit binary on a amd64 os. Server has the package "ia32-libs" however I didn't install that on the nodes. Thanks a lot for your help. ? // Naderan *Mahmood; ----- Original Message ----- From: Philippe Weill To: Mahmood Naderan ; Torque Users Mailing List Cc: Sent: Friday, July 13, 2012 1:09 PM Subject: Re: [torqueusers] qsub doesn't find the executable file Le 13/07/2012 08:41, Mahmood Naderan a ?crit : > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun? 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12? 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > could you do a file ./msim -- ? Weill Philippe -? Administrateur Systeme et Reseaux ? CNRS/UPMC/IPSL? LATMOS (UMR 8190) ? Tour 45/46 3e Etage B302 - 4 Place Jussieu - 75252 Paris Cedex 05 -? FRANCE ? Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759 Fax:+33 0144273776 From jinbianbian at huawei.com Thu Jul 12 20:15:56 2012 From: jinbianbian at huawei.com (Jinbianbian) Date: Fri, 13 Jul 2012 02:15:56 +0000 Subject: [torqueusers] torque qorder command do not work Message-ID: Hi,all I am using torque3.0.2 and maui3.3.1 . I found the command "qorder" did not work. I also found the problem On the network.there were some people got the problems ,but I did not find any solution. What I want to know: 1. Does the problem can be solved in my cluster? and how? 2. Which version of torque and maui can work correctly with the command "qorder" Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120713/8d8b510b/attachment.html From bdandrus at nps.edu Fri Jul 13 12:31:22 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Fri, 13 Jul 2012 18:31:22 +0000 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC1006@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> <20120404170947.GC30037@blackice.msi.umn.edu> <20120404175102.GD30037@blackice.msi.umn.edu> <560DBE57F33C4C4C9FBF11C662951AF805AC1006@ORSMSX106.amr.corp.intel.com> Message-ID: Hmm. I am trying to just build torque 4.0.1 using cpusets and I keep getting: =================================================== checking for HWLOC... configure: error: cpuset support requires the hwloc package Perhaps you should add the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. Alternatively, you may set the environment variables HWLOC_CFLAGS and HWLOC_LIBS before running configure. Example: export HWLOC_CFLAGS='-I/usr/local/hwloc-1.1/include' export HWLOC_LIBS='-L/usr/local/hwloc-1.1/lib -lhwloc' error: Bad exit status from /var/tmp/rpm-tmp.iowncz (%build) =================================================== I have the stock hwloc packages installed: [root at hamming SPECS]# rpm -qa |grep hwloc hwloc-devel-1.1-0.1.el6.i686 hwloc-1.1-0.1.el6.x86_64 hwloc-devel-1.1-0.1.el6.x86_64 hwloc-1.1-0.1.el6.i686 I have tried setting HWLOC_CFLAGS and HWLOC_LIBS to no avail. Is there something special to set when using the stock packages for CentOS 6.3? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Wednesday, April 04, 2012 12:07 PM To: Torque Users Mailing List Subject: Re: [torqueusers] TORQUE 4.0 and hwloc Yeah, when I passed the "-with cpuset" to the rpmbuild command then I got the following in my stdout from the actual rpm build process (from the configure part actually): checking whether to allow geometry requests... no checking whether to support NUMA systems... no checking for HWLOC... yes checking for hwloc_bitmap_alloc in -lhwloc... yes and then during the actual compile I see a -lhwloc being linked in. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gabe Turner Sent: Wednesday, April 04, 2012 10:51 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] TORQUE 4.0 and hwloc On Wed, Apr 04, 2012 at 12:09:47PM -0500, Gabe Turner wrote: > It looks to me like the spec file is supporting the --with option to > rpmbuild. So cpuset will be enabled as a configure option if you pass > '--with cpuset' to rpmbuild. Is that what you are already trying? I just did this to build the RPMs with support for cpusets. Admittedly, it is a bit cumbersome, though perhaps only because I have hwloc installed in a centralized location and not from an RPM. gabe at node1084 [~/torque-4.0.1] % make rpm HWLOC_CFLAGS='-I/soft/hwloc/1.4.1/include' HWLOC_LIBS='-L/soft/hwloc/1.4.1/lib -lhwloc' RPM_AC_OPTS+='--with cpuset' gabe at node1084 [~/torque-4.0.1] % rpm -qRp ~/rpmbuild/RPMS/x86_64/torque-client-4.0.1-1.cri.x86_64.rpm torque = 4.0.1-1.cri . . . libhwloc.so.5()(64bit) . . . -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From pankaj.dorlikar at gmail.com Fri Jul 13 12:33:38 2012 From: pankaj.dorlikar at gmail.com (pankaj dorlikar) Date: Sat, 14 Jul 2012 00:03:38 +0530 Subject: [torqueusers] issue with reservation Message-ID: Hi, We have maui-3.2.6p21 and Torque Server Version 2.5.8 on rhel 5.2 x86_64 nodes. we want to have daily reservation for the user john from 9pm to next day 9 am (12 hrs). we have following configuration for the this : SRCFG[john1] DAYS=ALL SRCFG[john1] STARTTIME=21:00:00 ENDTIME=01:09:00:00 SRCFG[john1] USERLIST=john SRCFG[john1] HOSTLIST=node1,node2,node3,node4 if the maui stops and starts at 00:01 daily, it is observed that the current standing reservation -03:01:00 start and 08:59:00 end (john1.0.0) gets deleted and instead of that the reservation is seen as : 21:00:00 start and 01:08:59:00 end (john1.0.0) which will start at 9 pm today but the current going rservation which has started yesterday and was going on gets deleted. is the reservation needs to be created from 9 pm to 00:01 am and then 00:01 am to 9 am ? or is there any other solution? thanks, -- Pankaj V. Dorlikar From dbeer at adaptivecomputing.com Fri Jul 13 13:28:09 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 13 Jul 2012 13:28:09 -0600 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> <20120404170947.GC30037@blackice.msi.umn.edu> <20120404175102.GD30037@blackice.msi.umn.edu> <560DBE57F33C4C4C9FBF11C662951AF805AC1006@ORSMSX106.amr.corp.intel.com> Message-ID: Brian, This is a bug in TORQUE 4.0.1 that is fixed in 4.0.3. You just need to add ./configure --with-path= For right now the only workaround is to install it to the default location or to grab the latest 4.0-fixes and install that using the configure mentioned above. David On Fri, Jul 13, 2012 at 12:31 PM, Andrus, Brian Contractor wrote: > Hmm. > > I am trying to just build torque 4.0.1 using cpusets and I keep getting: > =================================================== > checking for HWLOC... configure: error: cpuset support requires the hwloc > package > > > > Perhaps you should add the directory containing 'hwloc.pc' > to the PKG_CONFIG_PATH environment variable. > > Alternatively, you may set the environment variables > HWLOC_CFLAGS and HWLOC_LIBS before running configure. > > Example: > export HWLOC_CFLAGS='-I/usr/local/hwloc-1.1/include' > export HWLOC_LIBS='-L/usr/local/hwloc-1.1/lib -lhwloc' > > error: Bad exit status from /var/tmp/rpm-tmp.iowncz (%build) > =================================================== > I have the stock hwloc packages installed: > [root at hamming SPECS]# rpm -qa |grep hwloc > hwloc-devel-1.1-0.1.el6.i686 > hwloc-1.1-0.1.el6.x86_64 > hwloc-devel-1.1-0.1.el6.x86_64 > hwloc-1.1-0.1.el6.i686 > > I have tried setting HWLOC_CFLAGS and HWLOC_LIBS to no avail. > > Is there something special to set when using the stock packages for CentOS > 6.3? > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A > Sent: Wednesday, April 04, 2012 12:07 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] TORQUE 4.0 and hwloc > > Yeah, when I passed the "-with cpuset" to the rpmbuild command then I got > the following in my stdout from the actual rpm build process (from the > configure part actually): > > checking whether to allow geometry requests... no checking whether to > support NUMA systems... no checking for HWLOC... yes checking for > hwloc_bitmap_alloc in -lhwloc... yes > > and then during the actual compile I see a -lhwloc being linked in. > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Gabe Turner > Sent: Wednesday, April 04, 2012 10:51 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] TORQUE 4.0 and hwloc > > On Wed, Apr 04, 2012 at 12:09:47PM -0500, Gabe Turner wrote: > > It looks to me like the spec file is supporting the --with option to > > rpmbuild. So cpuset will be enabled as a configure option if you pass > > '--with cpuset' to rpmbuild. Is that what you are already trying? > > I just did this to build the RPMs with support for cpusets. Admittedly, it > is a bit cumbersome, though perhaps only because I have hwloc installed in > a centralized location and not from an RPM. > > gabe at node1084 [~/torque-4.0.1] % make rpm > HWLOC_CFLAGS='-I/soft/hwloc/1.4.1/include' > HWLOC_LIBS='-L/soft/hwloc/1.4.1/lib -lhwloc' RPM_AC_OPTS+='--with cpuset' > > gabe at node1084 [~/torque-4.0.1] % rpm -qRp > ~/rpmbuild/RPMS/x86_64/torque-client-4.0.1-1.cri.x86_64.rpm > torque = 4.0.1-1.cri > . > . > . > libhwloc.so.5()(64bit) > . > . > . > > > -- > Gabe Turner gabe at msi.umn.edu > HPC Systems Administrator, > University of Minnesota > Supercomputing Institute http://www.msi.umn.edu > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120713/806874b8/attachment.html From Gareth.Williams at csiro.au Sun Jul 15 19:17:51 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 16 Jul 2012 11:17:51 +1000 Subject: [torqueusers] torque qorder command do not work In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C62010529679F84@exvic-mbx04.nexus.csiro.au> Hi Jinbianbian, I suspect no-one will want to answer this so I'll give it a go. Maui (or another scheduler) would usually make decisions about how to order jobs based on a range of policies. In that context, the order that jobs were submitted does not necessarily matter. As such the function of 'qorder' is meaningless. It would only be useful if you used a scheduler that had a policy to run jobs in a fifo order unless qorder was invoked - but I don't think such a scheduler exists. I can think of a couple of actions that might help you. 1) use the setspri maui command to set a 'system' priority on jobs to set/force a particular order. 2) enable user priority in your scheduler policy and have users submit jobs with qsub -p (or maybe alter jobs with qalter -p) Both of these actions will need a raised level of privilege which may or may not be acceptable in your site. Gareth From: Jinbianbian [mailto:jinbianbian at huawei.com] Sent: Friday, 13 July 2012 12:16 PM To: torqueusers at supercluster.org Cc: Zhongjianfeng Subject: [torqueusers] torque qorder command do not work Hi,all I am using torque3.0.2 and maui3.3.1 . I found the command "qorder" did not work. I also found the problem On the network.there were some people got the problems ,but I did not find any solution. What I want to know: 1. Does the problem can be solved in my cluster? and how? 2. Which version of torque and maui can work correctly with the command "qorder" Thank you From taras.shapovalov at brightcomputing.com Mon Jul 16 06:08:19 2012 From: taras.shapovalov at brightcomputing.com (Taras Shapovalov) Date: Mon, 16 Jul 2012 14:08:19 +0200 Subject: [torqueusers] pbs_sched listening interface Message-ID: <50040433.6040104@brightcomputing.com> Hi all, Is it possible to change interface, which pbs_sched is listening to? It looks like pbs_sched does not use TRQ_IFNAME from torque.cfg. -- Taras From jinbianbian at huawei.com Tue Jul 17 02:09:42 2012 From: jinbianbian at huawei.com (Jinbianbian) Date: Tue, 17 Jul 2012 08:09:42 +0000 Subject: [torqueusers] =?gb2312?b?tPC4tDogdG9ycXVldXNlcnMgRGlnZXN0LCBWb2wg?= =?gb2312?b?OTYsIElzc3VlIDg=?= In-Reply-To: References: Message-ID: Hi all, Please give me some suggestions about this problem: 4. torque qorder command do not work (Jinbianbian) The detailed description of this problem is : I am using torque3.0.2 and maui3.3.1 . I found the command "qorder" did not work. I also found the problem On the network.there were some people got the problems ,but I did not find any solution. What I want to know: 1. Does the problem can be solved in my cluster? and how? 2. Which version of torque and maui can work correctly with the command "qorder" Thank you Mini -----????----- ???: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] ?? torqueusers-request at supercluster.org ????: 2012?7?14? 2:31 ???: torqueusers at supercluster.org ??: torqueusers Digest, Vol 96, Issue 8 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: qsub doesn't find the executable file (Lawrence Lowe) 2. Re: qsub doesn't find the executable file (Philippe Weill) 3. Re: qsub doesn't find the executable file (Mahmood Naderan) 4. torque qorder command do not work (Jinbianbian) 5. Re: TORQUE 4.0 and hwloc (Andrus, Brian Contractor) ---------------------------------------------------------------------- Message: 1 Date: Fri, 13 Jul 2012 09:36:45 +0100 (BST) From: Lawrence Lowe Subject: Re: [torqueusers] qsub doesn't find the executable file To: Mahmood Naderan , Torque Users Mailing List Message-ID: Content-Type: text/plain; charset="iso-8859-15" Hi, maybe the NFS filesystem is mounted with the "noexec" option. LSL. On Thu, 12 Jul 2012, Mahmood Naderan wrote: > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun? 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12? 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Mahmood Naderan > To: Torque Users Mailing List > Cc: > Sent: Friday, July 13, 2012 11:01 AM > Subject: Re: [torqueusers] qsub doesn't find the executable file > > Chris and Sreedha, > none of the proposed methods worked. > > /var/spool/pbs/mom_priv/jobs/61251.hpclab.orca.SC: line 8: ./msim: No such file or directory > > and > > /var/spool/pbs/mom_priv/jobs/61252.hpclab.orca.SC: line 8: /home/u1/workspace/msim: No such file or directory > > any more idea? > Actually I myself have not faced such problem. One of our users encountered that and now I have no more idea about that. > > ? > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Chris Evert > To: torqueusers at supercluster.org > Cc: > Sent: Friday, July 13, 2012 12:33 AM > Subject: Re: [torqueusers] qsub doesn't find the executable file > > Naderan Mahmood, > > It looks like . is not in your path.? If msim is in $PBS_O_WORKDIR, then > ./msim should work (like from the command line.) > > Another option is to provide the full path of msim in the file tor. > > Hope this helps, > Chris > -- > Chris Evert > Geokinetics, Inc. > Houston, TX > > On 07/12/2012 02:22 PM, Mahmood Naderan wrote: >> Dear all, >> I am trying to qsub an executable file but it doesn't find the file! >> >> -rwxrwxrwx? 1 u1 users 76277 Jul 12 23:08 msim >> -rw-r--r--? 1 u1 users? ? 89 Jul 12 23:45 tor >> >> >> tor file contains >> >> #PBS -N test >> #PBS -V >> #PBS -q orcaq >> #PBS -l nodes=1 >> #PBS -j oe >> cd $PBS_O_WORKDIR >> msim >> >> >> However the output file says: >> /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found >> >> >> binary file is not corrupted, since when run ./msim, it shows the usage. >> >> any idea for that? >> >> // Naderan *Mahmood; >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > ------------------------------ Message: 2 Date: Fri, 13 Jul 2012 10:39:34 +0200 From: Philippe Weill Subject: Re: [torqueusers] qsub doesn't find the executable file To: Mahmood Naderan , Torque Users Mailing List Message-ID: <4FFFDEC6.5030502 at latmos.ipsl.fr> Content-Type: text/plain; charset=UTF-8; format=flowed Le 13/07/2012 08:41, Mahmood Naderan a ?crit : > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > could you do a file ./msim -- Weill Philippe - Administrateur Systeme et Reseaux CNRS/UPMC/IPSL LATMOS (UMR 8190) Tour 45/46 3e Etage B302 - 4 Place Jussieu - 75252 Paris Cedex 05 - FRANCE Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759 Fax:+33 0144273776 ------------------------------ Message: 3 Date: Fri, 13 Jul 2012 02:31:37 -0700 (PDT) From: Mahmood Naderan Subject: Re: [torqueusers] qsub doesn't find the executable file To: torque cluster Message-ID: <1342171897.79889.YahooMailNeo at web111722.mail.gq1.yahoo.com> Content-Type: text/plain; charset=iso-8859-1 ok, file command points the right direction. msim: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for NU/Linux 2.6.15, BuildID[sha1]=0xcf66490c90387ce270d14bf4f4e3a01294935a36, not stripped So the problem is that I want to run a 32bit binary on a amd64 os. Server has the package "ia32-libs" however I didn't install that on the nodes. Thanks a lot for your help. ? // Naderan *Mahmood; ----- Original Message ----- From: Philippe Weill To: Mahmood Naderan ; Torque Users Mailing List Cc: Sent: Friday, July 13, 2012 1:09 PM Subject: Re: [torqueusers] qsub doesn't find the executable file Le 13/07/2012 08:41, Mahmood Naderan a ?crit : > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun? 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12? 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > could you do a file ./msim -- ? Weill Philippe -? Administrateur Systeme et Reseaux ? CNRS/UPMC/IPSL? LATMOS (UMR 8190) ? Tour 45/46 3e Etage B302 - 4 Place Jussieu - 75252 Paris Cedex 05 -? FRANCE ? Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759 Fax:+33 0144273776 ------------------------------ Message: 4 Date: Fri, 13 Jul 2012 02:15:56 +0000 From: Jinbianbian Subject: [torqueusers] torque qorder command do not work To: "torqueusers at supercluster.org" Cc: Zhongjianfeng Message-ID: Content-Type: text/plain; charset="us-ascii" Hi,all I am using torque3.0.2 and maui3.3.1 . I found the command "qorder" did not work. I also found the problem On the network.there were some people got the problems ,but I did not find any solution. What I want to know: 1. Does the problem can be solved in my cluster? and how? 2. Which version of torque and maui can work correctly with the command "qorder" Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120713/8d8b510b/attachment-0001.html ------------------------------ Message: 5 Date: Fri, 13 Jul 2012 18:31:22 +0000 From: "Andrus, Brian Contractor" Subject: Re: [torqueusers] TORQUE 4.0 and hwloc To: Torque Users Mailing List Message-ID: Content-Type: text/plain; charset="us-ascii" Hmm. I am trying to just build torque 4.0.1 using cpusets and I keep getting: =================================================== checking for HWLOC... configure: error: cpuset support requires the hwloc package Perhaps you should add the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. Alternatively, you may set the environment variables HWLOC_CFLAGS and HWLOC_LIBS before running configure. Example: export HWLOC_CFLAGS='-I/usr/local/hwloc-1.1/include' export HWLOC_LIBS='-L/usr/local/hwloc-1.1/lib -lhwloc' error: Bad exit status from /var/tmp/rpm-tmp.iowncz (%build) =================================================== I have the stock hwloc packages installed: [root at hamming SPECS]# rpm -qa |grep hwloc hwloc-devel-1.1-0.1.el6.i686 hwloc-1.1-0.1.el6.x86_64 hwloc-devel-1.1-0.1.el6.x86_64 hwloc-1.1-0.1.el6.i686 I have tried setting HWLOC_CFLAGS and HWLOC_LIBS to no avail. Is there something special to set when using the stock packages for CentOS 6.3? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Wednesday, April 04, 2012 12:07 PM To: Torque Users Mailing List Subject: Re: [torqueusers] TORQUE 4.0 and hwloc Yeah, when I passed the "-with cpuset" to the rpmbuild command then I got the following in my stdout from the actual rpm build process (from the configure part actually): checking whether to allow geometry requests... no checking whether to support NUMA systems... no checking for HWLOC... yes checking for hwloc_bitmap_alloc in -lhwloc... yes and then during the actual compile I see a -lhwloc being linked in. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gabe Turner Sent: Wednesday, April 04, 2012 10:51 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] TORQUE 4.0 and hwloc On Wed, Apr 04, 2012 at 12:09:47PM -0500, Gabe Turner wrote: > It looks to me like the spec file is supporting the --with option to > rpmbuild. So cpuset will be enabled as a configure option if you pass > '--with cpuset' to rpmbuild. Is that what you are already trying? I just did this to build the RPMs with support for cpusets. Admittedly, it is a bit cumbersome, though perhaps only because I have hwloc installed in a centralized location and not from an RPM. gabe at node1084 [~/torque-4.0.1] % make rpm HWLOC_CFLAGS='-I/soft/hwloc/1.4.1/include' HWLOC_LIBS='-L/soft/hwloc/1.4.1/lib -lhwloc' RPM_AC_OPTS+='--with cpuset' gabe at node1084 [~/torque-4.0.1] % rpm -qRp ~/rpmbuild/RPMS/x86_64/torque-client-4.0.1-1.cri.x86_64.rpm torque = 4.0.1-1.cri . . . libhwloc.so.5()(64bit) . . . -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 96, Issue 8 ****************************************** From jinbianbian at huawei.com Tue Jul 17 02:29:20 2012 From: jinbianbian at huawei.com (Jinbianbian) Date: Tue, 17 Jul 2012 08:29:20 +0000 Subject: [torqueusers] torque qorder command do not work (Gareth.Williams@csiro.au) In-Reply-To: References: Message-ID: Hi Gareth, Thank you for your reply. I think you are right,the 'qorder' command can not work as we want. It is so nice of you to give me some actions. Now I am considering to use the first action. This action also can slove my problem. Thank you very much. Wish you have a nice day! jinbianbian -----????----- ???: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] ?? torqueusers-request at supercluster.org ????: 2012?7?17? 16:10 ???: torqueusers at supercluster.org ??: torqueusers Digest, Vol 96, Issue 9 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. issue with reservation (pankaj dorlikar) 2. Re: TORQUE 4.0 and hwloc (David Beer) 3. Re: torque qorder command do not work (Gareth.Williams at csiro.au) 4. pbs_sched listening interface (Taras Shapovalov) 5. ??: torqueusers Digest, Vol 96, Issue 8 (Jinbianbian) ---------------------------------------------------------------------- Message: 1 Date: Sat, 14 Jul 2012 00:03:38 +0530 From: pankaj dorlikar Subject: [torqueusers] issue with reservation To: mauiusers , torqueusers Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hi, We have maui-3.2.6p21 and Torque Server Version 2.5.8 on rhel 5.2 x86_64 nodes. we want to have daily reservation for the user john from 9pm to next day 9 am (12 hrs). we have following configuration for the this : SRCFG[john1] DAYS=ALL SRCFG[john1] STARTTIME=21:00:00 ENDTIME=01:09:00:00 SRCFG[john1] USERLIST=john SRCFG[john1] HOSTLIST=node1,node2,node3,node4 if the maui stops and starts at 00:01 daily, it is observed that the current standing reservation -03:01:00 start and 08:59:00 end (john1.0.0) gets deleted and instead of that the reservation is seen as : 21:00:00 start and 01:08:59:00 end (john1.0.0) which will start at 9 pm today but the current going rservation which has started yesterday and was going on gets deleted. is the reservation needs to be created from 9 pm to 00:01 am and then 00:01 am to 9 am ? or is there any other solution? thanks, -- Pankaj V. Dorlikar ------------------------------ Message: 2 Date: Fri, 13 Jul 2012 13:28:09 -0600 From: David Beer Subject: Re: [torqueusers] TORQUE 4.0 and hwloc To: Torque Users Mailing List Message-ID: Content-Type: text/plain; charset="iso-8859-1" Brian, This is a bug in TORQUE 4.0.1 that is fixed in 4.0.3. You just need to add ./configure --with-path= For right now the only workaround is to install it to the default location or to grab the latest 4.0-fixes and install that using the configure mentioned above. David On Fri, Jul 13, 2012 at 12:31 PM, Andrus, Brian Contractor wrote: > Hmm. > > I am trying to just build torque 4.0.1 using cpusets and I keep getting: > =================================================== > checking for HWLOC... configure: error: cpuset support requires the hwloc > package > > > > Perhaps you should add the directory containing 'hwloc.pc' > to the PKG_CONFIG_PATH environment variable. > > Alternatively, you may set the environment variables > HWLOC_CFLAGS and HWLOC_LIBS before running configure. > > Example: > export HWLOC_CFLAGS='-I/usr/local/hwloc-1.1/include' > export HWLOC_LIBS='-L/usr/local/hwloc-1.1/lib -lhwloc' > > error: Bad exit status from /var/tmp/rpm-tmp.iowncz (%build) > =================================================== > I have the stock hwloc packages installed: > [root at hamming SPECS]# rpm -qa |grep hwloc > hwloc-devel-1.1-0.1.el6.i686 > hwloc-1.1-0.1.el6.x86_64 > hwloc-devel-1.1-0.1.el6.x86_64 > hwloc-1.1-0.1.el6.i686 > > I have tried setting HWLOC_CFLAGS and HWLOC_LIBS to no avail. > > Is there something special to set when using the stock packages for CentOS > 6.3? > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A > Sent: Wednesday, April 04, 2012 12:07 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] TORQUE 4.0 and hwloc > > Yeah, when I passed the "-with cpuset" to the rpmbuild command then I got > the following in my stdout from the actual rpm build process (from the > configure part actually): > > checking whether to allow geometry requests... no checking whether to > support NUMA systems... no checking for HWLOC... yes checking for > hwloc_bitmap_alloc in -lhwloc... yes > > and then during the actual compile I see a -lhwloc being linked in. > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Gabe Turner > Sent: Wednesday, April 04, 2012 10:51 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] TORQUE 4.0 and hwloc > > On Wed, Apr 04, 2012 at 12:09:47PM -0500, Gabe Turner wrote: > > It looks to me like the spec file is supporting the --with option to > > rpmbuild. So cpuset will be enabled as a configure option if you pass > > '--with cpuset' to rpmbuild. Is that what you are already trying? > > I just did this to build the RPMs with support for cpusets. Admittedly, it > is a bit cumbersome, though perhaps only because I have hwloc installed in > a centralized location and not from an RPM. > > gabe at node1084 [~/torque-4.0.1] % make rpm > HWLOC_CFLAGS='-I/soft/hwloc/1.4.1/include' > HWLOC_LIBS='-L/soft/hwloc/1.4.1/lib -lhwloc' RPM_AC_OPTS+='--with cpuset' > > gabe at node1084 [~/torque-4.0.1] % rpm -qRp > ~/rpmbuild/RPMS/x86_64/torque-client-4.0.1-1.cri.x86_64.rpm > torque = 4.0.1-1.cri > . > . > . > libhwloc.so.5()(64bit) > . > . > . > > > -- > Gabe Turner gabe at msi.umn.edu > HPC Systems Administrator, > University of Minnesota > Supercomputing Institute http://www.msi.umn.edu > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120713/806874b8/attachment-0001.html ------------------------------ Message: 3 Date: Mon, 16 Jul 2012 11:17:51 +1000 From: Subject: Re: [torqueusers] torque qorder command do not work To: Message-ID: <007DECE986B47F4EABF823C1FBB19C62010529679F84 at exvic-mbx04.nexus.csiro.au> Content-Type: text/plain; charset="us-ascii" Hi Jinbianbian, I suspect no-one will want to answer this so I'll give it a go. Maui (or another scheduler) would usually make decisions about how to order jobs based on a range of policies. In that context, the order that jobs were submitted does not necessarily matter. As such the function of 'qorder' is meaningless. It would only be useful if you used a scheduler that had a policy to run jobs in a fifo order unless qorder was invoked - but I don't think such a scheduler exists. I can think of a couple of actions that might help you. 1) use the setspri maui command to set a 'system' priority on jobs to set/force a particular order. 2) enable user priority in your scheduler policy and have users submit jobs with qsub -p (or maybe alter jobs with qalter -p) Both of these actions will need a raised level of privilege which may or may not be acceptable in your site. Gareth From: Jinbianbian [mailto:jinbianbian at huawei.com] Sent: Friday, 13 July 2012 12:16 PM To: torqueusers at supercluster.org Cc: Zhongjianfeng Subject: [torqueusers] torque qorder command do not work Hi,all I am using torque3.0.2 and maui3.3.1 . I found the command "qorder" did not work. I also found the problem On the network.there were some people got the problems ,but I did not find any solution. What I want to know: 1. Does the problem can be solved in my cluster? and how? 2. Which version of torque and maui can work correctly with the command "qorder" Thank you ------------------------------ Message: 4 Date: Mon, 16 Jul 2012 14:08:19 +0200 From: Taras Shapovalov Subject: [torqueusers] pbs_sched listening interface To: torqueusers at supercluster.org Message-ID: <50040433.6040104 at brightcomputing.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi all, Is it possible to change interface, which pbs_sched is listening to? It looks like pbs_sched does not use TRQ_IFNAME from torque.cfg. -- Taras ------------------------------ Message: 5 Date: Tue, 17 Jul 2012 08:09:42 +0000 From: Jinbianbian Subject: [torqueusers] ??: torqueusers Digest, Vol 96, Issue 8 To: "torqueusers at supercluster.org" Cc: Zhongjianfeng Message-ID: Content-Type: text/plain; charset="gb2312" Hi all, Please give me some suggestions about this problem: 4. torque qorder command do not work (Jinbianbian) The detailed description of this problem is : I am using torque3.0.2 and maui3.3.1 . I found the command "qorder" did not work. I also found the problem On the network.there were some people got the problems ,but I did not find any solution. What I want to know: 1. Does the problem can be solved in my cluster? and how? 2. Which version of torque and maui can work correctly with the command "qorder" Thank you Mini -----????----- ???: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] ?? torqueusers-request at supercluster.org ????: 2012?7?14? 2:31 ???: torqueusers at supercluster.org ??: torqueusers Digest, Vol 96, Issue 8 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: qsub doesn't find the executable file (Lawrence Lowe) 2. Re: qsub doesn't find the executable file (Philippe Weill) 3. Re: qsub doesn't find the executable file (Mahmood Naderan) 4. torque qorder command do not work (Jinbianbian) 5. Re: TORQUE 4.0 and hwloc (Andrus, Brian Contractor) ---------------------------------------------------------------------- Message: 1 Date: Fri, 13 Jul 2012 09:36:45 +0100 (BST) From: Lawrence Lowe Subject: Re: [torqueusers] qsub doesn't find the executable file To: Mahmood Naderan , Torque Users Mailing List Message-ID: Content-Type: text/plain; charset="iso-8859-15" Hi, maybe the NFS filesystem is mounted with the "noexec" option. LSL. On Thu, 12 Jul 2012, Mahmood Naderan wrote: > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun? 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12? 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Mahmood Naderan > To: Torque Users Mailing List > Cc: > Sent: Friday, July 13, 2012 11:01 AM > Subject: Re: [torqueusers] qsub doesn't find the executable file > > Chris and Sreedha, > none of the proposed methods worked. > > /var/spool/pbs/mom_priv/jobs/61251.hpclab.orca.SC: line 8: ./msim: No such file or directory > > and > > /var/spool/pbs/mom_priv/jobs/61252.hpclab.orca.SC: line 8: /home/u1/workspace/msim: No such file or directory > > any more idea? > Actually I myself have not faced such problem. One of our users encountered that and now I have no more idea about that. > > ? > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Chris Evert > To: torqueusers at supercluster.org > Cc: > Sent: Friday, July 13, 2012 12:33 AM > Subject: Re: [torqueusers] qsub doesn't find the executable file > > Naderan Mahmood, > > It looks like . is not in your path.? If msim is in $PBS_O_WORKDIR, then > ./msim should work (like from the command line.) > > Another option is to provide the full path of msim in the file tor. > > Hope this helps, > Chris > -- > Chris Evert > Geokinetics, Inc. > Houston, TX > > On 07/12/2012 02:22 PM, Mahmood Naderan wrote: >> Dear all, >> I am trying to qsub an executable file but it doesn't find the file! >> >> -rwxrwxrwx? 1 u1 users 76277 Jul 12 23:08 msim >> -rw-r--r--? 1 u1 users? ? 89 Jul 12 23:45 tor >> >> >> tor file contains >> >> #PBS -N test >> #PBS -V >> #PBS -q orcaq >> #PBS -l nodes=1 >> #PBS -j oe >> cd $PBS_O_WORKDIR >> msim >> >> >> However the output file says: >> /var/spool/pbs/mom_priv/jobs/61247.hpclab.orca.SC: line 8: msim: command not found >> >> >> binary file is not corrupted, since when run ./msim, it shows the usage. >> >> any idea for that? >> >> // Naderan *Mahmood; >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > ------------------------------ Message: 2 Date: Fri, 13 Jul 2012 10:39:34 +0200 From: Philippe Weill Subject: Re: [torqueusers] qsub doesn't find the executable file To: Mahmood Naderan , Torque Users Mailing List Message-ID: <4FFFDEC6.5030502 at latmos.ipsl.fr> Content-Type: text/plain; charset=UTF-8; format=flowed Le 13/07/2012 08:41, Mahmood Naderan a ?crit : > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > could you do a file ./msim -- Weill Philippe - Administrateur Systeme et Reseaux CNRS/UPMC/IPSL LATMOS (UMR 8190) Tour 45/46 3e Etage B302 - 4 Place Jussieu - 75252 Paris Cedex 05 - FRANCE Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759 Fax:+33 0144273776 ------------------------------ Message: 3 Date: Fri, 13 Jul 2012 02:31:37 -0700 (PDT) From: Mahmood Naderan Subject: Re: [torqueusers] qsub doesn't find the executable file To: torque cluster Message-ID: <1342171897.79889.YahooMailNeo at web111722.mail.gq1.yahoo.com> Content-Type: text/plain; charset=iso-8859-1 ok, file command points the right direction. msim: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for NU/Linux 2.6.15, BuildID[sha1]=0xcf66490c90387ce270d14bf4f4e3a01294935a36, not stripped So the problem is that I want to run a 32bit binary on a amd64 os. Server has the package "ia32-libs" however I didn't install that on the nodes. Thanks a lot for your help. ? // Naderan *Mahmood; ----- Original Message ----- From: Philippe Weill To: Mahmood Naderan ; Torque Users Mailing List Cc: Sent: Friday, July 13, 2012 1:09 PM Subject: Re: [torqueusers] qsub doesn't find the executable file Le 13/07/2012 08:41, Mahmood Naderan a ?crit : > I think I found the problem but don't know how to fix that. > /home is shared via NFS. So I ssh to a node (other than server). > > u1 at orca:~/workspace$ ssh ws05 > Last login: Sat Jun? 4 14:22:08 2011 from hpclab.orca > u1 at ws05:~$ ls > workspace > > u1 at ws05:~$ cd workspace/ > u1 at ws05:~/workspace$ ls -l msim > -rwxr-xr-x 1 u1 users 81179 Jul 12? 2012 msim > > u1 at ws05:~/workspace$ ./msim > -bash: ./msim: No such file or directory > > > It seems that on other nodes, it can not execute! > > could you do a file ./msim -- ? Weill Philippe -? Administrateur Systeme et Reseaux ? CNRS/UPMC/IPSL? LATMOS (UMR 8190) ? Tour 45/46 3e Etage B302 - 4 Place Jussieu - 75252 Paris Cedex 05 -? FRANCE ? Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759 Fax:+33 0144273776 ------------------------------ Message: 4 Date: Fri, 13 Jul 2012 02:15:56 +0000 From: Jinbianbian Subject: [torqueusers] torque qorder command do not work To: "torqueusers at supercluster.org" Cc: Zhongjianfeng Message-ID: Content-Type: text/plain; charset="us-ascii" Hi,all I am using torque3.0.2 and maui3.3.1 . I found the command "qorder" did not work. I also found the problem On the network.there were some people got the problems ,but I did not find any solution. What I want to know: 1. Does the problem can be solved in my cluster? and how? 2. Which version of torque and maui can work correctly with the command "qorder" Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120713/8d8b510b/attachment-0001.html ------------------------------ Message: 5 Date: Fri, 13 Jul 2012 18:31:22 +0000 From: "Andrus, Brian Contractor" Subject: Re: [torqueusers] TORQUE 4.0 and hwloc To: Torque Users Mailing List Message-ID: Content-Type: text/plain; charset="us-ascii" Hmm. I am trying to just build torque 4.0.1 using cpusets and I keep getting: =================================================== checking for HWLOC... configure: error: cpuset support requires the hwloc package Perhaps you should add the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. Alternatively, you may set the environment variables HWLOC_CFLAGS and HWLOC_LIBS before running configure. Example: export HWLOC_CFLAGS='-I/usr/local/hwloc-1.1/include' export HWLOC_LIBS='-L/usr/local/hwloc-1.1/lib -lhwloc' error: Bad exit status from /var/tmp/rpm-tmp.iowncz (%build) =================================================== I have the stock hwloc packages installed: [root at hamming SPECS]# rpm -qa |grep hwloc hwloc-devel-1.1-0.1.el6.i686 hwloc-1.1-0.1.el6.x86_64 hwloc-devel-1.1-0.1.el6.x86_64 hwloc-1.1-0.1.el6.i686 I have tried setting HWLOC_CFLAGS and HWLOC_LIBS to no avail. Is there something special to set when using the stock packages for CentOS 6.3? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Wednesday, April 04, 2012 12:07 PM To: Torque Users Mailing List Subject: Re: [torqueusers] TORQUE 4.0 and hwloc Yeah, when I passed the "-with cpuset" to the rpmbuild command then I got the following in my stdout from the actual rpm build process (from the configure part actually): checking whether to allow geometry requests... no checking whether to support NUMA systems... no checking for HWLOC... yes checking for hwloc_bitmap_alloc in -lhwloc... yes and then during the actual compile I see a -lhwloc being linked in. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gabe Turner Sent: Wednesday, April 04, 2012 10:51 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] TORQUE 4.0 and hwloc On Wed, Apr 04, 2012 at 12:09:47PM -0500, Gabe Turner wrote: > It looks to me like the spec file is supporting the --with option to > rpmbuild. So cpuset will be enabled as a configure option if you pass > '--with cpuset' to rpmbuild. Is that what you are already trying? I just did this to build the RPMs with support for cpusets. Admittedly, it is a bit cumbersome, though perhaps only because I have hwloc installed in a centralized location and not from an RPM. gabe at node1084 [~/torque-4.0.1] % make rpm HWLOC_CFLAGS='-I/soft/hwloc/1.4.1/include' HWLOC_LIBS='-L/soft/hwloc/1.4.1/lib -lhwloc' RPM_AC_OPTS+='--with cpuset' gabe at node1084 [~/torque-4.0.1] % rpm -qRp ~/rpmbuild/RPMS/x86_64/torque-client-4.0.1-1.cri.x86_64.rpm torque = 4.0.1-1.cri . . . libhwloc.so.5()(64bit) . . . -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 96, Issue 8 ****************************************** ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 96, Issue 9 ****************************************** From ramesh.kumar2 at wipro.com Wed Jul 18 06:54:10 2012 From: ramesh.kumar2 at wipro.com (ramesh.kumar2 at wipro.com) Date: Wed, 18 Jul 2012 12:54:10 +0000 Subject: [torqueusers] Torque Queue issue Message-ID: <611D228B63BCD64584F53F26EF82D2C50AAD0882@BLR-SJP-MBX-1.wipro.com> Hi, I am facing problem with Torque queue. I have configured 2 different queue as Q1 & Q2 and I have set properties in the node file as per requirement. I have set properties for some nodes as Q1 and some nodes as Q2 but when I submit jobs to the particular queue. It is going to default queue instead of going to particular queue. Need you quick help to resolve the issue. Thanks, Ramesh Kumar Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/587e7485/attachment.html From knielson at adaptivecomputing.com Wed Jul 18 08:55:58 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 18 Jul 2012 08:55:58 -0600 Subject: [torqueusers] Torque Queue issue In-Reply-To: <611D228B63BCD64584F53F26EF82D2C50AAD0882@BLR-SJP-MBX-1.wipro.com> References: <611D228B63BCD64584F53F26EF82D2C50AAD0882@BLR-SJP-MBX-1.wipro.com> Message-ID: On Wed, Jul 18, 2012 at 6:54 AM, wrote: > Hi, **** > > ** ** > > I am facing problem with Torque queue. **** > > ** ** > > I have configured 2 different queue as Q1 & Q2 and I have set properties > in the node file as per requirement. **** > > ** ** > > I have set properties for some nodes as Q1 and some nodes as Q2 but when I > submit jobs to the particular queue. It is going to default queue instead > of going to particular queue. **** > > ** ** > > Need you quick help to resolve the issue. **** > > ** ** > > Thanks, **** > > Ramesh Kumar**** > > ** ** > > Ramesh, You need to create a third queue of type route. You then need to make the routing queue the default queue. You will probably need to add some other information to your other queues as well. Can you post your output from qmgr -c 'p s'? Regards Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/dae2d490/attachment.html From fcaba at uns.edu.ar Wed Jul 18 09:57:01 2012 From: fcaba at uns.edu.ar (Fernando Caba) Date: Wed, 18 Jul 2012 12:57:01 -0300 Subject: [torqueusers] Moving jobs from one node to another Message-ID: <5006DCCD.9050808@uns.edu.ar> Hy, i want to know something about moving jobs from one node to another. If i need to do some manteinance in one node with a certain number of running jobs (they cannot be killed). Can i move those all jobs (or specific) to another node (free or not)? If yes, how? Regards Fernando -- ---------------------------------------------------- Ing. Fernando Caba Director General de Telecomunicaciones Universidad Nacional del Sur http://www.dgt.uns.edu.ar Tel/Fax: (54)-291-4595166 Tel: (54)-291-4595101 int. 2050 Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina ---------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4533 bytes Desc: Firma criptogr??fica S/MIME Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/420a12f5/attachment.bin From jpeltier at sfu.ca Wed Jul 18 17:41:54 2012 From: jpeltier at sfu.ca (James A. Peltier) Date: Wed, 18 Jul 2012 16:41:54 -0700 (PDT) Subject: [torqueusers] Torque Queue issue In-Reply-To: <611D228B63BCD64584F53F26EF82D2C50AAD0882@BLR-SJP-MBX-1.wipro.com> Message-ID: <285716704.17687457.1342654914840.JavaMail.root@jaguar10.sfu.ca> are you specifying -q q1 or -q q2? ----- Original Message ----- | Hi, | I am facing problem with Torque queue. | I have configured 2 different queue as Q1 & Q2 and I have set | properties in the node file as per requirement. | I have set properties for some nodes as Q1 and some nodes as Q2 but | when I submit jobs to the particular queue. It is going to default | queue instead of going to particular queue. | Need you quick help to resolve the issue. | Thanks, | Ramesh Kumar | Please do not print this email unless it is absolutely necessary. | The information contained in this electronic message and any | attachments to this message are intended for the exclusive use of | the addressee(s) and may contain proprietary, confidential or | privileged information. If you are not the intended recipient, you | should not disseminate, distribute or copy this e-mail. Please | notify the sender immediately and destroy all copies of this message | and any attachments. | WARNING: Computer viruses can be transmitted via email. The recipient | should check this email and any attachments for the presence of | viruses. The company accepts no liability for any damage caused by | any virus transmitted by this email. | www.wipro.com | _______________________________________________ | torqueusers mailing list | torqueusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/torqueusers -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier Success is to be measured not so much by the position that one has reached in life but as by the obstacles they have overcome. - Booker T. Washington -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/4b863c17/attachment-0001.html From ramesh.kumar2 at wipro.com Thu Jul 19 00:21:14 2012 From: ramesh.kumar2 at wipro.com (ramesh.kumar2 at wipro.com) Date: Thu, 19 Jul 2012 06:21:14 +0000 Subject: [torqueusers] torqueusers Digest, Vol 96, Issue 11 In-Reply-To: References: Message-ID: <611D228B63BCD64584F53F26EF82D2C50AAD0AB8@BLR-SJP-MBX-1.wipro.com> Thank Ken. Please find the below qmgr output [root at mandhan scripts]# qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue Q3 # create queue Q3 set queue Q3 queue_type = Execution set queue Q3 max_running = 3 set queue Q3 resources_default.neednodes = Q3 set queue Q3 resources_default.walltime = 08:00:00 set queue Q3 resources_available.nodect = 75 set queue Q3 enabled = True set queue Q3 started = True # # Create and define queue Q2 # create queue Q2 set queue Q2 queue_type = Execution set queue Q2 max_running = 11 set queue Q2 resources_default.neednodes = Q2 set queue Q2 resources_default.walltime = 08:00:00 set queue Q2 resources_available.nodect = 55 set queue Q2 enabled = True set queue Q2 started = True # # Create and define queue test # create queue test set queue test queue_type = Execution set queue test acl_host_enable = False set queue test acl_hosts = mandhan set queue test resources_default.neednodes = infiniband set queue test resources_default.nodes = 1 set queue test resources_default.walltime = 24:00:00 set queue test enabled = True set queue test started = True # # Create and define queue Q1 # create queue Q1 set queue Q1 queue_type = Execution set queue Q1 max_running = 9 set queue Q1 resources_default.neednodes = Q1 set queue Q1 resources_default.walltime = 08:00:00 set queue Q1 resources_available.nodect = 44 set queue Q1 enabled = True set queue Q1 started = True # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 13:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = sn1.caos.iisc.ernet.in set server acl_hosts += mandhan.caos.iisc.ernet.in set server default_queue = Q1 set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server submit_hosts = +sn1.caos.iisc.ernet.in set server auto_node_np = True set server next_job_number = 361 [root at mandhan scripts]# Rgds, Ramesh -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of torqueusers-request at supercluster.org Sent: Thursday, July 19, 2012 5:12 AM To: torqueusers at supercluster.org Subject: torqueusers Digest, Vol 96, Issue 11 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Torque Queue issue (ramesh.kumar2 at wipro.com) 2. Re: Torque Queue issue (Ken Nielson) 3. Moving jobs from one node to another (Fernando Caba) 4. Re: Torque Queue issue (James A. Peltier) ---------------------------------------------------------------------- Message: 1 Date: Wed, 18 Jul 2012 12:54:10 +0000 From: Subject: [torqueusers] Torque Queue issue To: Message-ID: <611D228B63BCD64584F53F26EF82D2C50AAD0882 at BLR-SJP-MBX-1.wipro.com> Content-Type: text/plain; charset="us-ascii" Hi, I am facing problem with Torque queue. I have configured 2 different queue as Q1 & Q2 and I have set properties in the node file as per requirement. I have set properties for some nodes as Q1 and some nodes as Q2 but when I submit jobs to the particular queue. It is going to default queue instead of going to particular queue. Need you quick help to resolve the issue. Thanks, Ramesh Kumar Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/587e7485/attachment-0001.html ------------------------------ Message: 2 Date: Wed, 18 Jul 2012 08:55:58 -0600 From: Ken Nielson Subject: Re: [torqueusers] Torque Queue issue To: Torque Users Mailing List Message-ID: Content-Type: text/plain; charset="iso-8859-1" On Wed, Jul 18, 2012 at 6:54 AM, wrote: > Hi, **** > > ** ** > > I am facing problem with Torque queue. **** > > ** ** > > I have configured 2 different queue as Q1 & Q2 and I have set properties > in the node file as per requirement. **** > > ** ** > > I have set properties for some nodes as Q1 and some nodes as Q2 but when I > submit jobs to the particular queue. It is going to default queue instead > of going to particular queue. **** > > ** ** > > Need you quick help to resolve the issue. **** > > ** ** > > Thanks, **** > > Ramesh Kumar**** > > ** ** > > Ramesh, You need to create a third queue of type route. You then need to make the routing queue the default queue. You will probably need to add some other information to your other queues as well. Can you post your output from qmgr -c 'p s'? Regards Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/dae2d490/attachment-0001.html ------------------------------ Message: 3 Date: Wed, 18 Jul 2012 12:57:01 -0300 From: Fernando Caba Subject: [torqueusers] Moving jobs from one node to another To: Torque Users Mailing List Message-ID: <5006DCCD.9050808 at uns.edu.ar> Content-Type: text/plain; charset="iso-8859-1" Hy, i want to know something about moving jobs from one node to another. If i need to do some manteinance in one node with a certain number of running jobs (they cannot be killed). Can i move those all jobs (or specific) to another node (free or not)? If yes, how? Regards Fernando -- ---------------------------------------------------- Ing. Fernando Caba Director General de Telecomunicaciones Universidad Nacional del Sur http://www.dgt.uns.edu.ar Tel/Fax: (54)-291-4595166 Tel: (54)-291-4595101 int. 2050 Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina ---------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4533 bytes Desc: Firma criptogr??fica S/MIME Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/420a12f5/attachment-0001.bin ------------------------------ Message: 4 Date: Wed, 18 Jul 2012 16:41:54 -0700 (PDT) From: "James A. Peltier" Subject: Re: [torqueusers] Torque Queue issue To: Torque Users Mailing List Message-ID: <285716704.17687457.1342654914840.JavaMail.root at jaguar10.sfu.ca> Content-Type: text/plain; charset="utf-8" are you specifying -q q1 or -q q2? ----- Original Message ----- | Hi, | I am facing problem with Torque queue. | I have configured 2 different queue as Q1 & Q2 and I have set | properties in the node file as per requirement. | I have set properties for some nodes as Q1 and some nodes as Q2 but | when I submit jobs to the particular queue. It is going to default | queue instead of going to particular queue. | Need you quick help to resolve the issue. | Thanks, | Ramesh Kumar | Please do not print this email unless it is absolutely necessary. | The information contained in this electronic message and any | attachments to this message are intended for the exclusive use of | the addressee(s) and may contain proprietary, confidential or | privileged information. If you are not the intended recipient, you | should not disseminate, distribute or copy this e-mail. Please | notify the sender immediately and destroy all copies of this message | and any attachments. | WARNING: Computer viruses can be transmitted via email. The recipient | should check this email and any attachments for the presence of | viruses. The company accepts no liability for any damage caused by | any virus transmitted by this email. | www.wipro.com | _______________________________________________ | torqueusers mailing list | torqueusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/torqueusers -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier Success is to be measured not so much by the position that one has reached in life but as by the obstacles they have overcome. - Booker T. Washington -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120718/4b863c17/attachment.html ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 96, Issue 11 ******************************************* Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com From sm4082 at nyu.edu Thu Jul 19 05:52:02 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 19 Jul 2012 07:52:02 -0400 Subject: [torqueusers] Torque Queue issue In-Reply-To: <611D228B63BCD64584F53F26EF82D2C50AAD0882@BLR-SJP-MBX-1.wipro.com> References: <611D228B63BCD64584F53F26EF82D2C50AAD0882@BLR-SJP-MBX-1.wipro.com> Message-ID: Hi Ramesh, Did you try with #PBS - l feature='property' Sreedhar. On Jul 18, 2012 10:53 AM, wrote: -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120719/049a5726/attachment.html From knielson at adaptivecomputing.com Thu Jul 19 10:49:35 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 19 Jul 2012 10:49:35 -0600 Subject: [torqueusers] torqueusers Digest, Vol 96, Issue 11 In-Reply-To: <611D228B63BCD64584F53F26EF82D2C50AAD0AB8@BLR-SJP-MBX-1.wipro.com> References: <611D228B63BCD64584F53F26EF82D2C50AAD0AB8@BLR-SJP-MBX-1.wipro.com> Message-ID: On Thu, Jul 19, 2012 at 12:21 AM, wrote: > Thank Ken. Please find the below qmgr output > > [root at mandhan scripts]# qmgr -c 'p s' > # > # Create queues and set their attributes. > # > # > # Create and define queue Q3 > # > create queue Q3 > set queue Q3 queue_type = Execution > set queue Q3 max_running = 3 > set queue Q3 resources_default.neednodes = Q3 > set queue Q3 resources_default.walltime = 08:00:00 > set queue Q3 resources_available.nodect = 75 > set queue Q3 enabled = True > set queue Q3 started = True > # > # Create and define queue Q2 > # > create queue Q2 > set queue Q2 queue_type = Execution > set queue Q2 max_running = 11 > set queue Q2 resources_default.neednodes = Q2 > set queue Q2 resources_default.walltime = 08:00:00 > set queue Q2 resources_available.nodect = 55 > set queue Q2 enabled = True > set queue Q2 started = True > # > # Create and define queue test > # > create queue test > set queue test queue_type = Execution > set queue test acl_host_enable = False > set queue test acl_hosts = mandhan > set queue test resources_default.neednodes = infiniband > set queue test resources_default.nodes = 1 > set queue test resources_default.walltime = 24:00:00 > set queue test enabled = True > set queue test started = True > # > # Create and define queue Q1 > # > create queue Q1 > set queue Q1 queue_type = Execution > set queue Q1 max_running = 9 > set queue Q1 resources_default.neednodes = Q1 > set queue Q1 resources_default.walltime = 08:00:00 > set queue Q1 resources_available.nodect = 44 > set queue Q1 enabled = True > set queue Q1 started = True > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 13:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = sn1.caos.iisc.ernet.in > set server acl_hosts += mandhan.caos.iisc.ernet.in > set server default_queue = Q1 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server submit_hosts = +sn1.caos.iisc.ernet.in > set server auto_node_np = True > set server next_job_number = 361 > [root at mandhan scripts]# > > Rgds, > Ramesh > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of > torqueusers-request at supercluster.org > Sent: Thursday, July 19, 2012 5:12 AM > To: torqueusers at supercluster.org > Subject: torqueusers Digest, Vol 96, Issue 11 > > Send torqueusers mailing list submissions to > torqueusers at supercluster.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.supercluster.org/mailman/listinfo/torqueusers > or, via email, send a message with subject or body 'help' to > torqueusers-request at supercluster.org > > You can reach the person managing the list at > torqueusers-owner at supercluster.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of torqueusers digest..." > > > Today's Topics: > > 1. Torque Queue issue (ramesh.kumar2 at wipro.com) > 2. Re: Torque Queue issue (Ken Nielson) > 3. Moving jobs from one node to another (Fernando Caba) > 4. Re: Torque Queue issue (James A. Peltier) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 18 Jul 2012 12:54:10 +0000 > From: > Subject: [torqueusers] Torque Queue issue > To: > Message-ID: > <611D228B63BCD64584F53F26EF82D2C50AAD0882 at BLR-SJP-MBX-1.wipro.com> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > I am facing problem with Torque queue. > > I have configured 2 different queue as Q1 & Q2 and I have set properties > in the node file as per requirement. > > I have set properties for some nodes as Q1 and some nodes as Q2 but when I > submit jobs to the particular queue. It is going to default queue instead > of going to particular queue. > > Need you quick help to resolve the issue. > > Thanks, > Ramesh Kumar > > > Ramesh, Check out these links about creating queues. This one points to how to create and configure queues in general. http://www.adaptivecomputing.com/resources/docs/torque/2-5-12/help.htm#topics/4-serverPolicies/queueConfig.htm This one points to how to create routing queues. http://www.adaptivecomputing.com/resources/docs/torque/2-5-12/help.htm#topics/4-serverPolicies/creatingRoutingQueue.htm Regards Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120719/03064640/attachment-0001.html From lloyd_brown at byu.edu Thu Jul 19 11:10:56 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Thu, 19 Jul 2012 11:10:56 -0600 Subject: [torqueusers] Torque 2.5.x->4.x upgrade finding already-running jobs? Message-ID: <50083FA0.6060408@byu.edu> Hi, all. We're considering upgrading our production cluster from Torque 2.5.9 to 4.1.0, and wondered if I could ask a question, to tap into the community's expertise. I know that, due to communication-protocol changes, we need to upgrade the pbs_server and pbs_mom's together. But is there any known issue with the 4.x pbs_mom's picking up on the already-running jobs (started by the 2.5.x pbs_mom)? Obviously the pbs_mom process won't be the parent process of the job, but that shouldn't be any different than restarting with the "-p" option anyway. I'll do some testing too, but I just thought I'd ask around to see if anyone has done it already, and what success/failures they've had. If it works, it could save us most of a full-system outage, which is a very big deal. -- Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu From ytt515 at yahoo.cn Mon Jul 23 02:34:37 2012 From: ytt515 at yahoo.cn (TingtingYang) Date: Mon, 23 Jul 2012 16:34:37 +0800 (CST) Subject: [torqueusers] checkpoint/restart mpi-job on different compute nodes Message-ID: <1343032477.22496.YahooMailClassic@web92205.mail.cnh.yahoo.com> hi all:? ?I want to use torque's checkpoint/restart function.I wonder if it possible to checkpoint/restart mpi jobs which run on multi-nodes ?with -l nodes=2:ppn=2. right now I can?checkpoint/restart mpi jobs which run on one node with ?-l nodes=1?ppn=4 and counter some error when I try to checkpoint/restart mpi jobs running on multi-nodesthank you? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Tingting Yang from Beihang university -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120723/9c8090e8/attachment.html From taras.shapovalov at brightcomputing.com Mon Jul 23 02:38:46 2012 From: taras.shapovalov at brightcomputing.com (Taras Shapovalov) Date: Mon, 23 Jul 2012 10:38:46 +0200 Subject: [torqueusers] pbs_sched listening interface In-Reply-To: <50040433.6040104@brightcomputing.com> References: <50040433.6040104@brightcomputing.com> Message-ID: <500D0D96.6030001@brightcomputing.com> No ideas? On 07/16/2012 02:08 PM, Taras Shapovalov wrote: > Hi all, > > Is it possible to change interface, which pbs_sched is listening to? > It looks like pbs_sched does not use TRQ_IFNAME from torque.cfg. > -- Taras From SAngelovich at lgc.com Tue Jul 24 21:51:09 2012 From: SAngelovich at lgc.com (Steve Angelovich) Date: Tue, 24 Jul 2012 22:51:09 -0500 Subject: [torqueusers] xpbsmon error In-Reply-To: References: Message-ID: <500F6D2D.4070404@lgc.com> We just updated to a centos 6.x head node and torque 2.5.9. When running xpbsmon from the head node we are getting the error below. When running on an older OS we are not getting an error. Is there a fix for this? Thanks, Steve bad variable name "pref.top.box1(entryval,1)": upvar won't create a scalar variable that looks like an array element bad variable name "pref.top.box1(entryval,1)": upvar won't create a scalar variable that looks like an array element while executing "global [set [set menuName](textvariable)]" (procedure "menuEntry" line 50) invoked from within "menuEntry $f.b.e.$k create -menuvalues $args -title $elabel -textvariable [set boxName](entryval,$k)" (procedure "box" line 263) invoked from within "box $dbox_top.box1 -title "Sites Preference" -entrylabels [list [list "Site Name" "" ""] [list "View" MENU_ENTRY "ICON" "FULL"]] -lboxlabels [list ..." (procedure "pref" line 16) invoked from within "pref $dialog(mainWindow) $dialog(mainWindow)" invoked from within ".main.menubar.buttons.1.prefer invoke" ("uplevel" body line 1) invoked from within "uplevel #0 [list $w invoke]" (procedure "tk::ButtonUp" line 22) invoked from within "tk::ButtonUp .main.menubar.buttons.1.prefer" (command bound to event) ---------------------------------------------------------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message. From yotama9 at gmail.com Tue Jul 24 02:34:16 2012 From: yotama9 at gmail.com (Yotam Avital) Date: Tue, 24 Jul 2012 11:34:16 +0300 Subject: [torqueusers] no C++ outout files Message-ID: I'm running some computer simulation on a pbs machine (I'm not sure if it's pbs or torque). The programs are supposed to generate/overwrite a file every 1000 cycles, something that is supposed to happen several times a day. However, no output file is generated. This output file is essential for my work. How can I generate the file? Thanks. If this may be an issue, here is how I print data to file this->file.open("FTout.dat",std::ios::trunc); int q; int I; int J; int L; file << "q\th\n"; for (int i = 0; i < P.FTsize*0.5; i++){ for (int j = 0; j < P.FTsize; j++){ J = j - 4; if ((i == 0 && J > 0) ||(i > 0)){ q = int(pow(i,2) + pow(J,2)); if(q <= 13){ file << q << "\t" << ft.ffs[i][j][i][j].size << "\n" ; } } } } this->file.close(); Where file is defined in the header: ofstream file; -- My other email account has a "professional" signature. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120724/f7f9629a/attachment-0001.html From yjdvsroger at yahoo.com.cn Tue Jul 24 06:29:30 2012 From: yjdvsroger at yahoo.com.cn (=?gb2312?B?vNG2sCDOtQ==?=) Date: Tue, 24 Jul 2012 20:29:30 +0800 (CST) Subject: [torqueusers] Issues with Parallel Job Submissions in PBS Message-ID: <1343132970.7874.YahooMailClassic@web15003.mail.cnb.yahoo.com> Hi, I am a newbie! I have a two nodes PBS system, one with pbs_server, pbs_mom, pbs_sched, the other with pbs_mom. I can submit serial job to both nodes, and when I submit a parallel job, it went into the queue and ready to run, but never executed. Attached is relevant information about configurations, error messages, etc. Any help is greatly appreciated :-) Regards, Lancelot -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120724/27ff4858/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: new file Type: application/octet-stream Size: 15926 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120724/27ff4858/attachment-0001.obj From brockp at umich.edu Wed Jul 25 09:29:34 2012 From: brockp at umich.edu (Brock Palen) Date: Wed, 25 Jul 2012 11:29:34 -0400 Subject: [torqueusers] osc mpiexec and torque4 Message-ID: The OSC mpiexec appears to have issues with torque 4.1.0 but works fine with 2.x Has anyone gotten mpiexec (the popular tm aware launcher for mpich2 and mvapich) to work with torque 4? I have some debugging information below: [brockp at nyx7000 ~]$ /home/software/rhel6/mpiexec/bin/mpiexec -v -v -v ~/a.out mpiexec: stat_exe: testing "/home/brockp/a.out". mpiexec: resolve_exe: using absolute path "/home/brockp/a.out". mpiexec: stdio_notice_streams: aggregate = 0 1 2. mpiexec: concurrent_init: unix socket exists, trying to connect. mpiexec: concurrent_init: old master died, reusing his fifo as master. mpiexec: concurrent_init: i am concurrent master. Segmentation fault (gdb) where #0 0x00000036afd31aff in __strlen_sse42 () from /lib64/libc.so.6 #1 0x00002aaaaaac53af in pbs_connect (server_name_ptr=0x0) at ../Libifl/pbsD_connect.c:1256 #2 0x0000000000405170 in get_hosts () at get_hosts.c:98 #3 0x0000000000403601 in main (argc=1, argv=0x7fffffffd890) at mpiexec.c:700 Line 1256 of pbsD_connect.c is: strncat(server_name_list, pbs_get_server_list(), sizeof(server_name_list) -1 - strlen(server_name_ptr) - 1); Examining server_name_list and server_name_ptr I get interesting results: (gdb) x server_name_list 0x7fffffffc5f0: 0x00000000 (gdb) printf "%s", server_name_list (nothing returned by gdb) (gdb) x server_name_ptr 0x0: Cannot access memory at address 0x0 The empty string of server_name_list and the cannot access memory appear strange to me, but I am not sure. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 From philippe.Weill at latmos.ipsl.fr Wed Jul 25 09:41:10 2012 From: philippe.Weill at latmos.ipsl.fr (Philippe Weill) Date: Wed, 25 Jul 2012 17:41:10 +0200 Subject: [torqueusers] Issues with Parallel Job Submissions in PBS In-Reply-To: <1343132970.7874.YahooMailClassic@web15003.mail.cnb.yahoo.com> References: <1343132970.7874.YahooMailClassic@web15003.mail.cnb.yahoo.com> Message-ID: <50101396.2060208@latmos.ipsl.fr> Le 24/07/2012 14:29, ?? ? a ?crit : > Hi, > > I am a newbie! > > I have a two nodes PBS system, one with pbs_server, pbs_mom, pbs_sched, the other with pbs_mom. > I can submit serial job to both nodes, and when I submit a parallel job, it went into the queue and ready to run, but never executed. > > Attached is relevant information about configurations, error messages, etc. Any help is greatly appreciated :-) > > Regards, > Lancelot seem for me you need set server node_pack = False to use nodes=x:ppn=y -- Weill Philippe - Administrateur Systeme et Reseaux CNRS/UPMC/IPSL LATMOS (UMR 8190) Tour 45/46 3e Etage B302 - 4 Place Jussieu - 75252 Paris Cedex 05 - FRANCE Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759 Fax:+33 0144273776 From mej at lbl.gov Wed Jul 25 10:06:14 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 25 Jul 2012 09:06:14 -0700 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: References: Message-ID: <20120725160613.GF5670@lbl.gov> On Wednesday, 25 July 2012, at 11:29:34 (-0400), Brock Palen wrote: > The OSC mpiexec appears to have issues with torque 4.1.0 but works fine with 2.x > > Has anyone gotten mpiexec (the popular tm aware launcher for mpich2 and mvapich) to work with torque 4? > > I have some debugging information below: > > [brockp at nyx7000 ~]$ /home/software/rhel6/mpiexec/bin/mpiexec -v -v -v ~/a.out > mpiexec: stat_exe: testing "/home/brockp/a.out". > mpiexec: resolve_exe: using absolute path "/home/brockp/a.out". > mpiexec: stdio_notice_streams: aggregate = 0 1 2. > mpiexec: concurrent_init: unix socket exists, trying to connect. > mpiexec: concurrent_init: old master died, reusing his fifo as master. > mpiexec: concurrent_init: i am concurrent master. > Segmentation fault > > > (gdb) where > #0 0x00000036afd31aff in __strlen_sse42 () from /lib64/libc.so.6 > #1 0x00002aaaaaac53af in pbs_connect (server_name_ptr=0x0) at ../Libifl/pbsD_connect.c:1256 > #2 0x0000000000405170 in get_hosts () at get_hosts.c:98 > #3 0x0000000000403601 in main (argc=1, argv=0x7fffffffd890) at mpiexec.c:700 > > > Line 1256 of pbsD_connect.c is: > strncat(server_name_list, pbs_get_server_list(), > sizeof(server_name_list) -1 - strlen(server_name_ptr) - 1); You can try changing the strlen() call to: ((server_name_ptr) ? (strlen(server_name_ptr)) : (0)) but that won't fix the ultimate problem of an invalid server name being passed in by mpiexec. (It will, however, make libtorque more robust and should probably be done upstream.) In fact, I'd change that whole section of code: /* Use the list from the server_name file. * If a server name is passed in, append it at the beginning. */ if (server_name_ptr && server_name_ptr[0]) { snprintf(server_name_list, sizeof(server_name_list), "%s,%s", server_name_ptr, pbs_get_server_list()); } else { strncat(server_name_list, pbs_get_server_list(), sizeof(server_name_list) - 1); } if (getenv("PBSDEBUG")) fprintf(stderr, "pbs_connect using following server list \"%s\"\n", server_name_list); > Examining server_name_list and server_name_ptr I get interesting results: > > (gdb) x server_name_list > 0x7fffffffc5f0: 0x00000000 > (gdb) printf "%s", server_name_list > (nothing returned by gdb) > > (gdb) x server_name_ptr > 0x0: Cannot access memory at address 0x0 > > The empty string of server_name_list and the cannot access memory > appear strange to me, but I am not sure. There's at least one bug in terms of lack of robustness in not handling the case where server_name_ptr is NULL. However, there's another problem somewhere regarding why it's NULL to begin with. Have you looked at the get_hosts() code (in mpiexec) at all? That's probably where I'd start. Once you/we figure out why that variable is being passed to pbs_connect() as NULL, we should have a better idea what's going awry. HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From djohnson at osc.edu Wed Jul 25 10:48:54 2012 From: djohnson at osc.edu (Doug Johnson) Date: Wed, 25 Jul 2012 12:48:54 -0400 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: <20120725160613.GF5670@lbl.gov> References: <20120725160613.GF5670@lbl.gov> Message-ID: At Wed, 25 Jul 2012 09:06:14 -0700, Michael Jennings wrote: > > On Wednesday, 25 July 2012, at 11:29:34 (-0400), > Brock Palen wrote: > > > The OSC mpiexec appears to have issues with torque 4.1.0 but works fine with 2.x > > > > Has anyone gotten mpiexec (the popular tm aware launcher for mpich2 and mvapich) to work with torque 4? > > > > I have some debugging information below: > > > > [brockp at nyx7000 ~]$ /home/software/rhel6/mpiexec/bin/mpiexec -v -v -v ~/a.out > > mpiexec: stat_exe: testing "/home/brockp/a.out". > > mpiexec: resolve_exe: using absolute path "/home/brockp/a.out". > > mpiexec: stdio_notice_streams: aggregate = 0 1 2. > > mpiexec: concurrent_init: unix socket exists, trying to connect. > > mpiexec: concurrent_init: old master died, reusing his fifo as master. > > mpiexec: concurrent_init: i am concurrent master. > > Segmentation fault > > > > > > (gdb) where > > #0 0x00000036afd31aff in __strlen_sse42 () from /lib64/libc.so.6 > > #1 0x00002aaaaaac53af in pbs_connect (server_name_ptr=0x0) at ../Libifl/pbsD_connect.c:1256 > > #2 0x0000000000405170 in get_hosts () at get_hosts.c:98 > > #3 0x0000000000403601 in main (argc=1, argv=0x7fffffffd890) at mpiexec.c:700 > > > > > > Line 1256 of pbsD_connect.c is: > > strncat(server_name_list, pbs_get_server_list(), > > sizeof(server_name_list) -1 - strlen(server_name_ptr) - 1); > > You can try changing the strlen() call to: > > ((server_name_ptr) ? (strlen(server_name_ptr)) : (0)) > > but that won't fix the ultimate problem of an invalid server name > being passed in by mpiexec. (It will, however, make libtorque more > robust and should probably be done upstream.) > > In fact, I'd change that whole section of code: > > /* Use the list from the server_name file. > * If a server name is passed in, append it at the beginning. */ > > if (server_name_ptr && server_name_ptr[0]) > { > snprintf(server_name_list, sizeof(server_name_list), "%s,%s", > server_name_ptr, pbs_get_server_list()); > } > else > { > strncat(server_name_list, pbs_get_server_list(), > sizeof(server_name_list) - 1); > } > > if (getenv("PBSDEBUG")) > fprintf(stderr, "pbs_connect using following server list \"%s\"\n", > server_name_list); > > > Examining server_name_list and server_name_ptr I get interesting results: > > > > (gdb) x server_name_list > > 0x7fffffffc5f0: 0x00000000 > > (gdb) printf "%s", server_name_list > > (nothing returned by gdb) > > > > (gdb) x server_name_ptr > > 0x0: Cannot access memory at address 0x0 > > > > The empty string of server_name_list and the cannot access memory > > appear strange to me, but I am not sure. > > There's at least one bug in terms of lack of robustness in not > handling the case where server_name_ptr is NULL. However, there's > another problem somewhere regarding why it's NULL to begin with. Have > you looked at the get_hosts() code (in mpiexec) at all? That's > probably where I'd start. Once you/we figure out why that variable is > being passed to pbs_connect() as NULL, we should have a better idea > what's going awry. > Has pbs_connect changed in torque 4? From the man page, If the parameter, server, is either the null string or a null pointer, a connection will be opened to the default server. The default server is defined by (a) the setting of the environment variable PBS_DEFAULT which contains a destination, or (b) the desti- nation in the batch administrator established file {PBS_DIR}/default_destn. Either something is wrong in Brock's environment, or pbs_connect does not work the same in torque 4.0. I agree that more error checking in pbs_connect on the client side is probably needed. Here are the relevant lines from mpiexec, /* * Now go talk to PBS. Get the hostnames in the job and compress it * down to our idea of nodes, matching up against the tasklist as we go. */ fd = pbs_connect(0); if (fd < 0) error_pbs("%s: pbs_connect", __func__); The pbs_connect succeeds. I'm not sure what more error checking could be done in mpiexec. Doug From mej at lbl.gov Wed Jul 25 11:01:16 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 25 Jul 2012 10:01:16 -0700 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: References: <20120725160613.GF5670@lbl.gov> Message-ID: <20120725170115.GJ5670@lbl.gov> On Wednesday, 25 July 2012, at 12:48:54 (-0400), Doug Johnson wrote: > Has pbs_connect changed in torque 4? It has, indeed. :-) > From the man page, > > If the parameter, server, is either the null string or a null > pointer, a connection will be opened to the default server. The > default server is defined by (a) the setting of the environment > variable PBS_DEFAULT which contains a destination, or (b) the desti- > nation in the batch administrator established file > {PBS_DIR}/default_destn. Ah, okay, sounds like mpiexec is relying on defined behavior that TORQUE is no longer properly handling. I wrongly assumed passing NULL was an error; sorry! :-] Comparing the code from the 2.5 branch and the 4.x branch, it looks like someone was trying to clean up/consolidate some code and overlooked the NULL/empty string case. The change I recommended in my previous e-mail should handle this just fine. Brock, you may want to open a bug for this against TORQUE 4. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From lloyd_brown at byu.edu Wed Jul 25 11:19:42 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 25 Jul 2012 11:19:42 -0600 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: <20120725170115.GJ5670@lbl.gov> References: <20120725160613.GF5670@lbl.gov> <20120725170115.GJ5670@lbl.gov> Message-ID: <50102AAE.2010204@byu.edu> I've heard through the grapevine that there's a torque release with a number of bugfixes, due out at the end of July. Maybe if you hurry, you can get this in. My guess is that, with less than a week 'till then, that'll be hard, but you can try, right? Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 07/25/2012 11:01 AM, Michael Jennings wrote: > On Wednesday, 25 July 2012, at 12:48:54 (-0400), > Doug Johnson wrote: > >> Has pbs_connect changed in torque 4? > > It has, indeed. :-) > >> From the man page, >> >> If the parameter, server, is either the null string or a null >> pointer, a connection will be opened to the default server. The >> default server is defined by (a) the setting of the environment >> variable PBS_DEFAULT which contains a destination, or (b) the desti- >> nation in the batch administrator established file >> {PBS_DIR}/default_destn. > > Ah, okay, sounds like mpiexec is relying on defined behavior that > TORQUE is no longer properly handling. I wrongly assumed passing NULL > was an error; sorry! :-] > > Comparing the code from the 2.5 branch and the 4.x branch, it looks > like someone was trying to clean up/consolidate some code and > overlooked the NULL/empty string case. > > The change I recommended in my previous e-mail should handle this just > fine. Brock, you may want to open a bug for this against TORQUE 4. > > Michael > From brockp at umich.edu Wed Jul 25 12:15:29 2012 From: brockp at umich.edu (Brock Palen) Date: Wed, 25 Jul 2012 14:15:29 -0400 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: <50102AAE.2010204@byu.edu> References: <20120725160613.GF5670@lbl.gov> <20120725170115.GJ5670@lbl.gov> <50102AAE.2010204@byu.edu> Message-ID: I actually tried with PBS_SERVER defined and not defined. In both cases the same problem. I modified mpiexec to change fd = pbs_connect("our server host") ; And this fixes the problem for our immediate needs (matlab+mvapich+tm support), Obviously this is not acceptable as a long term solution. I will look at filing a bug with adaptive about this behavior change. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 On Jul 25, 2012, at 1:19 PM, Lloyd Brown wrote: > I've heard through the grapevine that there's a torque release with a > number of bugfixes, due out at the end of July. Maybe if you hurry, you > can get this in. > > My guess is that, with less than a week 'till then, that'll be hard, but > you can try, right? > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 07/25/2012 11:01 AM, Michael Jennings wrote: >> On Wednesday, 25 July 2012, at 12:48:54 (-0400), >> Doug Johnson wrote: >> >>> Has pbs_connect changed in torque 4? >> >> It has, indeed. :-) >> >>> From the man page, >>> >>> If the parameter, server, is either the null string or a null >>> pointer, a connection will be opened to the default server. The >>> default server is defined by (a) the setting of the environment >>> variable PBS_DEFAULT which contains a destination, or (b) the desti- >>> nation in the batch administrator established file >>> {PBS_DIR}/default_destn. >> >> Ah, okay, sounds like mpiexec is relying on defined behavior that >> TORQUE is no longer properly handling. I wrongly assumed passing NULL >> was an error; sorry! :-] >> >> Comparing the code from the 2.5 branch and the 4.x branch, it looks >> like someone was trying to clean up/consolidate some code and >> overlooked the NULL/empty string case. >> >> The change I recommended in my previous e-mail should handle this just >> fine. Brock, you may want to open a bug for this against TORQUE 4. >> >> Michael >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Wed Jul 25 13:41:29 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 25 Jul 2012 12:41:29 -0700 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: References: <20120725160613.GF5670@lbl.gov> <20120725170115.GJ5670@lbl.gov> <50102AAE.2010204@byu.edu> Message-ID: <20120725194128.GK5670@lbl.gov> On Wednesday, 25 July 2012, at 14:15:29 (-0400), Brock Palen wrote: > I actually tried with PBS_SERVER defined and not defined. In both cases the same problem. > > I modified mpiexec to change > > fd = pbs_connect("our server host") ; > > And this fixes the problem for our immediate needs (matlab+mvapich+tm support), > > Obviously this is not acceptable as a long term solution. > I will look at filing a bug with adaptive about this behavior change. Did you happen to try the change I recommended to TORQUE? AFAICT, that's the correct long-term fix. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From brockp at umich.edu Wed Jul 25 14:19:10 2012 From: brockp at umich.edu (Brock Palen) Date: Wed, 25 Jul 2012 16:19:10 -0400 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: <20120725194128.GK5670@lbl.gov> References: <20120725160613.GF5670@lbl.gov> <20120725170115.GJ5670@lbl.gov> <50102AAE.2010204@byu.edu> <20120725194128.GK5670@lbl.gov> Message-ID: I did not attempt the change to torque because I need to get things moving to production asap for us. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 On Jul 25, 2012, at 3:41 PM, Michael Jennings wrote: > On Wednesday, 25 July 2012, at 14:15:29 (-0400), > Brock Palen wrote: > >> I actually tried with PBS_SERVER defined and not defined. In both cases the same problem. >> >> I modified mpiexec to change >> >> fd = pbs_connect("our server host") ; >> >> And this fixes the problem for our immediate needs (matlab+mvapich+tm support), >> >> Obviously this is not acceptable as a long term solution. >> I will look at filing a bug with adaptive about this behavior change. > > Did you happen to try the change I recommended to TORQUE? AFAICT, > that's the correct long-term fix. > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jwilkinson at stoneeagle.com Wed Jul 25 12:19:34 2012 From: jwilkinson at stoneeagle.com (Jack Wilkinson) Date: Wed, 25 Jul 2012 18:19:34 +0000 Subject: [torqueusers] More appropriate list Message-ID: <00051F5C670B8444B35CB2B31B9B1D090C4987D4@se-ex2.stoneeagle.com> I've some questions about the actual use of various PBS commands as opposed to issues with the configuration of the system. Is there a more appropriate list I should subscribe to? Regards, Jack Wilkinson, Programmer Services | VPay(r) P: 972.367-6622 jwilkinson at stoneeagle.com www.stoneeagle.com www.vpayusa.com 111 W. Spring Valley Rd., #100 Richardson, TX 75081 CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you received this email and are not the intended recipient, please inform the sender by email reply and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120725/36ba9fc1/attachment.html From glen.beane at gmail.com Wed Jul 25 16:47:54 2012 From: glen.beane at gmail.com (Glen Beane) Date: Wed, 25 Jul 2012 18:47:54 -0400 Subject: [torqueusers] More appropriate list In-Reply-To: <00051F5C670B8444B35CB2B31B9B1D090C4987D4@se-ex2.stoneeagle.com> References: <00051F5C670B8444B35CB2B31B9B1D090C4987D4@se-ex2.stoneeagle.com> Message-ID: I think the use of TORQUE is a valid topic for the torque users mailing list. On Wed, Jul 25, 2012 at 2:19 PM, Jack Wilkinson wrote: > I?ve some questions about the actual use of various PBS commands as opposed > to issues with the configuration of the system. Is there a more appropriate > list I should subscribe to? > > > > Regards, > > > > Jack Wilkinson, Programmer > > Services | VPay? > > P: 972.367-6622 > > jwilkinson at stoneeagle.com > > www.stoneeagle.com > > www.vpayusa.com > > > > 111 W. Spring Valley Rd., #100 > > Richardson, TX 75081 > > > > CONFIDENTIALITY NOTICE: This email, including any attachments, is for the > sole use of the intended recipient(s) and may contain confidential and > privileged information. Any unauthorized review, use, disclosure, or > distribution is prohibited. If you received this email and are not the > intended recipient, please inform the sender by email reply and destroy all > copies of the original message. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From yjdvsroger at yahoo.com.cn Thu Jul 26 00:06:41 2012 From: yjdvsroger at yahoo.com.cn (=?gb2312?B?vNG2sCDOtQ==?=) Date: Thu, 26 Jul 2012 14:06:41 +0800 (CST) Subject: [torqueusers] Issues with Parallel Job Submissions in PBS Message-ID: <1343282801.98077.YahooMailClassic@web15007.mail.cnb.yahoo.com> I have found the problem, thx! I failed to install MPI into the same path! followed my configuration and error message . EMAILS: PBS JOB ID:76.lancelot-laptop [lancelot at cfa ~]$ cat job3.pbs #!/bin/bash #PBS -N job3 #PBS -o job3.log #PBS -e job3.err #PBS -q sai #PBA -I #PBS -l nodes=2:ppn=2 #PBS -l walltime=24:00:00 #PBS -l cput=1:00:00 #PBS -V cd /home/lancelot echo running on hosts `hostname` echo time is `date` echo directory is $PWD echo job runs on the nodes: cat $PBS_NODEFILE NPROCS=`wc -l < $PBS_NODEFILE` echo this job has allocated $NPROCS nodes mpiexec -np 4 ./prog [lancelot at cfa ~]$ cat prog #!/bin/bash echo 999999999|./icpi root at lancelot-laptop:/home/lancelot# pbsnodes lancelot-laptop state = free np = 2 ntype = cluster jobs = 0/76.lancelot-laptop status = rectime=1343122703,varattr=,jobs=76.lancelot-laptop,state=free,netload=95261305,gres=,loadave=0.57,ncpus=2,physmem=1542608kb,availmem=2981784kb,totmem=3494344kb,idletime=14158,nusers=2,nsessions=13,sessions=1100 792 1309 1349 1365 1374 1384 1439 1452 1682 1749 1798 2737,uname=Linux lancelot-laptop 2.6.32-41-generic #94-Ubuntu SMP Fri Jul 6 16:51:39 UTC 2012 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 cfa state = free np = 12 ntype = cluster jobs = 0/76.lancelot-laptop status = rectime=1343122703,varattr=,jobs=76.lancelot-laptop,state=free,netload=492745850,gres=,loadave=0.00,ncpus=12,physmem=8015456kb,availmem=22517440kb,totmem=24399448kb,idletime=2992,nusers=5,nsessions=58,sessions=18335 469 27670 752 18344 834 1171 1982 2226 3403 2290 14058 14160 14359 14579 15144 15464 15698 15913 16121 16201 16444 16988 17058 17603 18048 18278 18378 18379 18405 18411 18479 18557 18884 19096 22028 22149 22256 22257 22283 22290 22347 27347 27515 27561 30703 30712 30795 30797 30823 30829 30905 32454 32458 32459 32467 32469 32489,uname=Linux cfa 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 root at lancelot-laptop:/home/lancelot# tracejob 76 Job: 76.lancelot-laptop 07/24/2012 15:01:03 M JOIN JOB as node 1 07/24/2012 15:01:03 S enqueuing into sai, state 1 hop 1 07/24/2012 15:01:03 S Job Queued at request of lancelot at cfa, owner = lancelot at cfa, job name = job3, queue = sai 07/24/2012 15:01:03 S Job Modified at request of Scheduler at lancelot-laptop 07/24/2012 15:01:03 L Job Run 07/24/2012 15:01:03 S Job Run at request of Scheduler at lancelot-laptop 07/24/2012 15:01:03 A queue=sai 07/24/2012 15:01:03 A user=lancelot group=lancelot jobname=job3 queue=sai ctime=1343113263 qtime=1343113263 etime=1343113263 start=1343113263 owner=lancelot at cfa exec_host=cfa/0+lancelot-laptop/0 Resource_List.cput=01:00:00 Resource_List.neednodes=2 Resource_List.nodect=2 Resource_List.nodes=2 Resource_List.walltime=24:00:00 07/24/2012 15:01:57 S Not sending email: User does not want mail of this type. root at lancelot-laptop:/home/lancelot# tracejob 77 Job: 77.lancelot-laptop 07/24/2012 15:13:11 S enqueuing into sai, state 1 hop 1 07/24/2012 15:13:11 S Job Queued at request of lancelot at cfa, owner = lancelot at cfa, job name = job4, queue = sai 07/24/2012 15:13:11 S Job Modified at request of Scheduler at lancelot-laptop 07/24/2012 15:13:11 L Job Run 07/24/2012 15:13:11 S Job Run at request of Scheduler at lancelot-laptop 07/24/2012 15:13:11 S Not sending email: User does not want mail of this type. 07/24/2012 15:13:11 A queue=sai 07/24/2012 15:13:11 A user=lancelot group=lancelot jobname=job4 queue=sai ctime=1343113991 qtime=1343113991 etime=1343113991 start=1343113991 owner=lancelot at cfa exec_host=lancelot-laptop/1 Resource_List.cput=01:00:00 Resource_List.walltime=24:00:00 07/24/2012 15:13:56 S Not sending email: User does not want mail of this type. 07/24/2012 15:13:56 S Exit_status=0 resources_used.cput=00:00:45 resources_used.mem=5300kb resources_used.vmem=19680kb resources_used.walltime=00:00:45 07/24/2012 15:13:56 M scan_for_terminated: job 77.lancelot-laptop task 1 terminated, sid=4008 07/24/2012 15:13:56 M job was terminated 07/24/2012 15:13:56 M obit sent to server 07/24/2012 15:13:56 A user=lancelot group=lancelot jobname=job4 queue=sai ctime=1343113991 qtime=1343113991 etime=1343113991 start=1343113991 owner=lancelot at cfa exec_host=lancelot-laptop/1 Resource_List.cput=01:00:00 Resource_List.walltime=24:00:00 session=4008 end=1343114036 Exit_status=0 resources_used.cput=00:00:45 resources_used.mem=5300kb resources_used.vmem=19680kb resources_used.walltime=00:00:45 07/24/2012 15:13:57 M removed job script 07/24/2012 15:18:57 S dequeuing from sai, state COMPLETE root at lancelot-laptop:/home/lancelot# tracejob 78 /var/spool/torque/mom_logs/20120724: No matching job records located Job: 78.lancelot-laptop 07/24/2012 16:25:51 S enqueuing into sai, state 1 hop 1 07/24/2012 16:25:51 S Job Queued at request of lancelot at cfa, owner = lancelot at cfa, job name = job3, queue = sai 07/24/2012 16:25:51 A queue=sai 07/24/2012 16:25:56 S Job Modified at request of Scheduler at lancelot-laptop 07/24/2012 16:25:56 L Not enough of the right type of nodes available root at lancelot-laptop:/home/lancelot# qstat -f 76 Job Id: 76.lancelot-laptop Job_Name = job3 Job_Owner = lancelot at cfa resources_used.cput = 00:00:00 resources_used.mem = 9304kb resources_used.vmem = 478176kb resources_used.walltime = 01:15:51 job_state = R queue = sai server = lancelot-laptop Checkpoint = u ctime = Tue Jul 24 15:01:03 2012 Error_Path = cfa:/home/lancelot/job3.err exec_host = cfa/0+lancelot-laptop/0 exec_port = 15003+15003 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Jul 24 15:01:57 2012 Output_Path = cfa:/home/lancelot/job3.log Priority = 0 qtime = Tue Jul 24 15:01:03 2012 Rerunable = True Resource_List.cput = 01:00:00 Resource_List.neednodes = 2 Resource_List.nodect = 2 Resource_List.nodes = 2 Resource_List.walltime = 24:00:00 session_id = 752 substate = 42 Variable_List = PBS_O_QUEUE=sai,PBS_O_HOST=cfa,PBS_O_HOME=/home/lancelot, PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=lancelot, PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/ sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/torque/bin:/usr/local/t orque/sbin:/usr/local/maui/bin:/usr/local/maui/sbin:/usr/java/jdk1.6.0 _33/bin:/home/shu/software/mpich2-1.4:/root/bin:/usr/local/torque/bin: /usr/local/maui/bin:/usr/java/jdk1.6.0_33/bin:/home/shu/software/mpich 2-1.4,PBS_O_MAIL=/var/spool/mail/lancelot,PBS_O_SHELL=/bin/bash, PBS_SERVER=lancelot-laptop,PBS_O_WORKDIR=/home/lancelot, TOMCAT_HOME=/home/shu/software/apache-tomcat-7.0.29,HOSTNAME=cfa, SHELL=/bin/bash,TERM=xterm,HISTSIZE=1000, SSH_CLIENT=192.168.0.46 58198 22,QTDIR=/usr/lib64/qt-3.3, QTINC=/usr/lib64/qt-3.3/include,SSH_TTY=/dev/pts/6,USER=lancelot, LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd= 40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=3 0;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj =01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*. zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01 ;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=0 1;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpi o=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.b mp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:* .xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01; 35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v= 01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.v ob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.r mvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:* .dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35: *.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36 :*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=0 1;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx= 01;36:*.xspf=01;36:, PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/ usr/bin:/sbin:/bin:/usr/games:/usr/local/torque/bin:/usr/local/torque/ sbin:/usr/local/maui/bin:/usr/local/maui/sbin:/usr/java/jdk1.6.0_33/bi n:/home/shu/software/mpich2-1.4:/root/bin:/usr/local/torque/bin:/usr/l ocal/maui/bin:/usr/java/jdk1.6.0_33/bin:/home/shu/software/mpich2-1.4, MAIL=/var/spool/mail/lancelot,PWD=/home/lancelot, JAVA_HOME=/usr/java/jdk1.6.0_33,LANG=en_US.UTF-8, HISTCONTROL=ignoredups, SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass, HOME=/home/lancelot,SHLVL=6,LOGNAME=lancelot,CVS_RSH=ssh, QTLIB=/usr/lib64/qt-3.3/lib, SSH_CONNECTION=192.168.0.46 58198 192.168.0.111 22, CLASSPATH=.:/usr/java/jdk1.6.0_33/jre/lib/rt.jar:/usr/java/jdk1.6.0_3 3/lib/dt.jar:/usr/java/jdk1.6.0_33/lib/tools.jar, LESSOPEN=|/usr/bin/lesspipe.sh %s,TORQUE=/usr/local/torque, MAUI=/usr/local/maui,G_BROKEN_FILENAMES=1,_=/usr/local/bin/qsub euser = lancelot egroup = lancelot hashname = 76.lancelot-laptop queue_rank = 19 queue_type = E comment = Job started on Tue Jul 24 at 15:01 etime = Tue Jul 24 15:01:03 2012 submit_args = job3.pbs start_time = Tue Jul 24 15:01:03 2012 Walltime.Remaining = 76692 start_count = 1 fault_tolerant = False submit_host = cfa init_work_dir = /home/lancelot root at lancelot-laptop:/home/lancelot# qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 76.lancelot-laptop job3 lancelot 00:00:00 R sai 78.lancelot-laptop job3 lancelot 0 Q sai root at lancelot-laptop:/home/lancelot# qsub --version version: 3.0.3 root at lancelot-laptop:/home/lancelot# qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue sai # create queue sai set queue sai queue_type = Execution set queue sai acl_groups = lancelot-laptop set queue sai acl_group_sloppy = True set queue sai route_destinations = lancelot-laptop set queue sai enabled = True set queue sai started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = lancelot-laptop set server managers = lancelot at lancelot-laptop set server operators = lancelot at lancelot-laptop set server default_queue = sai set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 79 root at lancelot-laptop:/home/lancelot# qmgr -c "list queue sai" Queue sai queue_type = Execution total_jobs = 2 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:0 acl_groups = lancelot-laptop acl_group_sloppy = True mtime = Tue Jul 24 11:31:37 2012 resources_assigned.nodect = 2 route_destinations = lancelot-laptop enabled = True started = True root at lancelot-laptop:/home/lancelot# cat /var/spool/torque/server_priv/nodes lancelot-laptop np=2 cfa np=12 tom np=2 root at lancelot-laptop:/home/lancelot# cat /var/spool/torque/mom_priv/config $pbsserver lancelot-laptop $logevent 255 cat server_name lancelot-laptop root at lancelot-laptop:/home/lancelot# qstat -Q Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T ---------------- --- --- --- --- --- --- --- --- --- --- - sai 0 2 yes yes 1 1 0 0 0 0 E root at lancelot-laptop:/home/lancelot# qstat -q server: lancelot-laptop Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- sai -- -- -- -- 1 1 -- E R ----- ----- 1 1 root at lancelot-laptop:/home/lancelot# qstat -B Server Max Tot Que Run Hld Wat Trn Ext Status ---------------- --- --- --- --- --- --- --- --- ---------- lancelot-laptop 0 2 1 1 0 0 0 0 Active mom_logs: 07/24/2012 14:53:43;0002; pbs_mom;Svr;im_eof;End of File from addr 192.168.0.111:1023 07/24/2012 14:53:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 14:54:44;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 07/24/2012 14:54:44;0001; pbs_mom;Job;TMomFinalizeJob3;job 75.lancelot-laptop started, pid = 3807 07/24/2012 14:55:29;0080; pbs_mom;Job;75.lancelot-laptop;scan_for_terminated: job 75.lancelot-laptop task 1 terminated, sid=3807 07/24/2012 14:55:29;0008; pbs_mom;Job;75.lancelot-laptop;job was terminated 07/24/2012 14:55:29;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 07/24/2012 14:55:29;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 07/24/2012 14:55:29;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 07/24/2012 14:55:29;0080; pbs_mom;Job;75.lancelot-laptop;obit sent to server 07/24/2012 14:55:29;0080; pbs_mom;Job;75.lancelot-laptop;removed job script 07/24/2012 14:58:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 15:01:03;0008; pbs_mom;Job;76.lancelot-laptop;JOIN JOB as node 1 07/24/2012 15:03:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 15:08:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 15:13:11;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 07/24/2012 15:13:11;0001; pbs_mom;Job;TMomFinalizeJob3;job 77.lancelot-laptop started, pid = 4008 07/24/2012 15:13:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 15:13:56;0080; pbs_mom;Job;77.lancelot-laptop;scan_for_terminated: job 77.lancelot-laptop task 1 terminated, sid=4008 07/24/2012 15:13:56;0008; pbs_mom;Job;77.lancelot-laptop;job was terminated 07/24/2012 15:13:56;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 07/24/2012 15:13:56;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 07/24/2012 15:13:56;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 07/24/2012 15:13:56;0080; pbs_mom;Job;77.lancelot-laptop;obit sent to server 07/24/2012 15:13:57;0080; pbs_mom;Job;77.lancelot-laptop;removed job script 07/24/2012 15:18:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 15:23:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 15:28:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 07/24/2012 15:33:50;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.3, loglevel = 0 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120726/92677c3d/attachment-0001.html From samuel at unimelb.edu.au Thu Jul 26 19:37:04 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 27 Jul 2012 11:37:04 +1000 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: References: <20120725160613.GF5670@lbl.gov> <20120725170115.GJ5670@lbl.gov> <50102AAE.2010204@byu.edu> Message-ID: <5011F0C0.2090805@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hiya Brock, On 26/07/12 04:15, Brock Palen wrote: > I will look at filing a bug with adaptive about this behavior > change. In case nobody's mentioned it yet, this would be a good place to start: http://www.clusterresources.com/bugzilla/ If you've got Moab then once you've reported that you can open a support case with Adaptive for 4.1 and reference that bug. Best of luck! Chris (still on 2.4, I'm turning into a Garrick) - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAR8MAACgkQO2KABBYQAh/MDgCdEd7ppPrcqysFiFmx8Pe48TJU mxMAn1+o7tpdCtQp36UxaFCtU5GATpIh =kLbG -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Jul 26 19:45:38 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 27 Jul 2012 11:45:38 +1000 Subject: [torqueusers] pbs_sched listening interface In-Reply-To: <50040433.6040104@brightcomputing.com> References: <50040433.6040104@brightcomputing.com> Message-ID: <5011F2C2.6090208@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 16/07/12 22:08, Taras Shapovalov wrote: > Is it possible to change interface, which pbs_sched is listening > to? It looks like pbs_sched does not use TRQ_IFNAME from > torque.cfg. Just had a look at the 4.1-fixes branch, it appears the code just gets the IP address for whatever gethostname() returns. So it doesn't look like it's possible as written, but it should be possible for a suitably motivated person to patch it to do so! cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAR8sIACgkQO2KABBYQAh8HYgCfShybjqPfitQmCiuQk/clPgJp dQQAn0cFEH2dTBbh1FuyJEWaVGnc/1m5 =UdsG -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Jul 26 19:48:55 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 27 Jul 2012 11:48:55 +1000 Subject: [torqueusers] no C++ outout files In-Reply-To: References: Message-ID: <5011F387.50402@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 24/07/12 18:34, Yotam Avital wrote: > I'm running some computer simulation on a pbs machine (I'm not sure > if it's pbs or torque). The programs are supposed to > generate/overwrite a file every 1000 cycles, something that is > supposed to happen several times a day. However, no output file is > generated. This output file is essential for my work. How can I > generate the file? Sounds like something to report to your local sysadmins. My guess is that your code is writing to local disk on the compute nodes and so you'll need to copy the file back to the submit node at the end. *If* that is the case then you'll want to look at the qsub manual page and the stage_in and stage_out directives. If you're writing to a network filesystem and nothing is appearing then something is very wrong and only your local sysadmin can help you.. Best of luck, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAR84cACgkQO2KABBYQAh+SYACeOZ16U/jdBAwCthG+PMtOBIXN nzcAn1FM1oywZ7ViTCuNN/dQpS45bnza =lUT9 -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Jul 26 19:49:55 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 27 Jul 2012 11:49:55 +1000 Subject: [torqueusers] Torque 2.5.x->4.x upgrade finding already-running jobs? In-Reply-To: <50083FA0.6060408@byu.edu> References: <50083FA0.6060408@byu.edu> Message-ID: <5011F3C3.6050809@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/07/12 03:10, Lloyd Brown wrote: > I know that, due to communication-protocol changes, we need to > upgrade the pbs_server and pbs_mom's together. But is there any > known issue with the 4.x pbs_mom's picking up on the > already-running jobs (started by the 2.5.x pbs_mom)? Obviously the > pbs_mom process won't be the parent process of the job, but that > shouldn't be any different than restarting with the "-p" option > anyway. To be honest I wouldn't even try that, 4.x uses hwloc rather than its own cpusets code so I don't believe it's worth the risk. - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAR88MACgkQO2KABBYQAh/R3ACfVPch5ZlnZP4EgPDnaG2mTF4+ kFEAn0VWbEOQzOds6v5afPGoe8HW+LZ5 =JOs4 -----END PGP SIGNATURE----- From jwilkinson at stoneeagle.com Fri Jul 27 09:18:23 2012 From: jwilkinson at stoneeagle.com (Jack Wilkinson) Date: Fri, 27 Jul 2012 15:18:23 +0000 Subject: [torqueusers] A couple of user based questions... Message-ID: <00051F5C670B8444B35CB2B31B9B1D090C498907@se-ex2.stoneeagle.com> Dear list members, I hope I'm not mis-asking some questions, but you seem to be pretty much my only resource at this point-in-time. 1) We have a small batch farm, 1 head box, 4 batch boxes. Assuming that the head box is running, is there a command that I can issue to find out which of the batch boxes are running? 2) I did not configure this setup. The only way I've found to be able to spread four jobs across the four boxes is to specifically request a node by name. For example: #PBS -l nodes=1:ppn=1 #PBS -l nodes=vpaybatch01 (and here, for each of the remaining files I do 02, 03, 04) Is there some way that I can just qsub four jobs and have them each go to the next available batch box? Thank you! Jack Wilkinson, Programmer Services | VPay(r) P: 972.367-6622 jwilkinson at stoneeagle.com www.stoneeagle.com www.vpayusa.com 111 W. Spring Valley Rd., #100 Richardson, TX 75081 CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you received this email and are not the intended recipient, please inform the sender by email reply and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120727/b53a9c6d/attachment.html From dbeer at adaptivecomputing.com Fri Jul 27 09:20:27 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 27 Jul 2012 09:20:27 -0600 Subject: [torqueusers] A couple of user based questions... In-Reply-To: <00051F5C670B8444B35CB2B31B9B1D090C498907@se-ex2.stoneeagle.com> References: <00051F5C670B8444B35CB2B31B9B1D090C498907@se-ex2.stoneeagle.com> Message-ID: pbsnodes should tell you the status of the nodes in the system. Check out the man page, there are many different options for this command to check on the status of your cluster. David On Fri, Jul 27, 2012 at 9:18 AM, Jack Wilkinson wrote: > Dear list members, I hope I?m not mis-asking some questions, but you > seem to be pretty much my only resource at this point-in-time. > > > > 1) We have a small batch farm, 1 head box, 4 batch boxes. Assuming > that the head box is running, is there a command that I can issue to find > out which of the batch boxes are running? > > 2) I did not configure this setup. The only way I?ve found to be > able to spread four jobs across the four boxes is to specifically request a > node by name. For example: > #PBS ?l nodes=1:ppn=1 > #PBS ?l nodes=vpaybatch01 (and here, for each of the remaining files I > do 02, 03, 04) > Is there some way that I can just qsub four jobs and have them each go to > the next available batch box? > > > > Thank you! > > > > *Jack Wilkinson,* Programmer > > Services | VPay? > > P: 972.367-6622 > > jwilkinson at stoneeagle.com > > www.stoneeagle.com > > www.vpayusa.com > > > > 111 W. Spring Valley Rd., #100 > > Richardson, TX 75081 > > > CONFIDENTIALITY NOTICE: This email, including any attachments, is for the > sole use of the intended recipient(s) and may contain confidential and > privileged information. Any unauthorized review, use, disclosure, or > distribution is prohibited. If you received this email and are not the > intended recipient, please inform the sender by email reply and destroy all > copies of the original message. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120727/2f4c8763/attachment-0001.html From lloyd_brown at byu.edu Fri Jul 27 09:22:50 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Fri, 27 Jul 2012 09:22:50 -0600 Subject: [torqueusers] Torque 2.5.x->4.x upgrade finding already-running jobs? In-Reply-To: <5011F3C3.6050809@unimelb.edu.au> References: <50083FA0.6060408@byu.edu> <5011F3C3.6050809@unimelb.edu.au> Message-ID: <5012B24A.8000407@byu.edu> Interesting. I didn't explicitly enable either one. I don't know about the hwloc stuff in 4.x, but don't I have to explicitly compile in the cpusets code in 2.x? I don't think I did, in this case. Would that have an effect? In any case, this is why I'm testing this on a staging cluster, not on our production system. If I can pull it off, then we can upgrade without needing to drain the whole cluster. If not, then we're no worse off than if I hadn't tried. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 07/26/2012 07:49 PM, Christopher Samuel wrote: > To be honest I wouldn't even try that, 4.x uses hwloc rather than > its own cpusets code so I don't believe it's worth the risk. From dbeer at adaptivecomputing.com Fri Jul 27 09:26:38 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 27 Jul 2012 09:26:38 -0600 Subject: [torqueusers] Torque 2.5.x->4.x upgrade finding already-running jobs? In-Reply-To: <5012B24A.8000407@byu.edu> References: <50083FA0.6060408@byu.edu> <5011F3C3.6050809@unimelb.edu.au> <5012B24A.8000407@byu.edu> Message-ID: Lloyd, If you don't use cpusets then the change to hwloc won't affect you at all. You have to configure cpusets on in order to use them. At any rate I'm interested to hear about the results your test. David On Fri, Jul 27, 2012 at 9:22 AM, Lloyd Brown wrote: > Interesting. I didn't explicitly enable either one. I don't know > about the hwloc stuff in 4.x, but don't I have to explicitly compile > in the cpusets code in 2.x? I don't think I did, in this case. Would > that have an effect? > > In any case, this is why I'm testing this on a staging cluster, not on > our production system. If I can pull it off, then we can upgrade > without needing to drain the whole cluster. If not, then we're no > worse off than if I hadn't tried. > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 07/26/2012 07:49 PM, Christopher Samuel wrote: > > To be honest I wouldn't even try that, 4.x uses hwloc rather than > > its own cpusets code so I don't believe it's worth the risk. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120727/9c43f68a/attachment.html From lloyd_brown at byu.edu Fri Jul 27 09:29:02 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Fri, 27 Jul 2012 09:29:02 -0600 Subject: [torqueusers] Torque 2.5.x->4.x upgrade finding already-running jobs? In-Reply-To: References: <50083FA0.6060408@byu.edu> <5011F3C3.6050809@unimelb.edu.au> <5012B24A.8000407@byu.edu> Message-ID: <5012B3BE.6070305@byu.edu> Well, so far, I have the pbs_server successfully upgrading and finding hte jobs, and the pbs_mom's upgrading, but for some reason the pbs_mom's are ignoring the running processes, and deleting the TORQUEHOME/mom_priv/jobs/*.{JB,SC,TK} files. They don't delete the jobs on the server, or kill the running PIDs, though, so it's a little confusing. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 07/27/2012 09:26 AM, David Beer wrote: > Lloyd, > > If you don't use cpusets then the change to hwloc won't affect you at > all. You have to configure cpusets on in order to use them. At any rate > I'm interested to hear about the results your test. > > David > > On Fri, Jul 27, 2012 at 9:22 AM, Lloyd Brown > wrote: > > Interesting. I didn't explicitly enable either one. I don't know > about the hwloc stuff in 4.x, but don't I have to explicitly compile > in the cpusets code in 2.x? I don't think I did, in this case. Would > that have an effect? > > In any case, this is why I'm testing this on a staging cluster, not on > our production system. If I can pull it off, then we can upgrade > without needing to drain the whole cluster. If not, then we're no > worse off than if I hadn't tried. > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 07/26/2012 07:49 PM, Christopher Samuel wrote: > > To be honest I wouldn't even try that, 4.x uses hwloc rather than > > its own cpusets code so I don't believe it's worth the risk. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From laytonjb at att.net Sat Jul 28 09:03:37 2012 From: laytonjb at att.net (Jeff Layton) Date: Sat, 28 Jul 2012 11:03:37 -0400 Subject: [torqueusers] Hostname problems on compute node with pbs_mom? Message-ID: <5013FF49.1080407@att.net> Good morning, I'm running Torque 4.0.2 on a cluster and have installed torque and torque-client on the compute node. When I try to start pbs_mom, I get the following error in the mom_logs: 07/28/2012 11:24:20;0002; pbs_mom;Svr;Log;Log opened 07/28/2012 11:24:20;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.0.2, loglevel = 0 07/28/2012 11:24:20;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::pbs_mom, Unable to get my full hostname for n0001 error -1 The name of the compute node is n0001 (no domain or anything). TIA! Jeff From andre.gemuend at scai.fraunhofer.de Mon Jul 30 01:39:58 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Mon, 30 Jul 2012 09:39:58 +0200 (CEST) Subject: [torqueusers] A couple of user based questions... In-Reply-To: <00051F5C670B8444B35CB2B31B9B1D090C498907@se-ex2.stoneeagle.com> Message-ID: <2051367844.3061221.1343633998554.JavaMail.root@scai.fraunhofer.de> > 2) I did not configure this setup. The only way I?ve found to be able > to spread four jobs across the four boxes is to specifically request > a node by name. For example: > #PBS ?l nodes=1:ppn=1 > #PBS ?l nodes=vpaybatch01 (and here, for each of the remaining files > I do 02, 03, 04) > Is there some way that I can just qsub four jobs and have them each > go to the next available batch box? That depends on the scheduler setup. You have to check if you are using maui/moab, or pbs_sched (the builtin scheduler). For Maui/Moab, have a look at NODEALLOCATIONPOLICY http://www.adaptivecomputing.com/resources/docs/maui/5.2nodeallocation.php Greetings Andr? -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From taras.shapovalov at brightcomputing.com Mon Jul 30 03:38:07 2012 From: taras.shapovalov at brightcomputing.com (Taras Shapovalov) Date: Mon, 30 Jul 2012 11:38:07 +0200 Subject: [torqueusers] pbs_sched listening interface In-Reply-To: <5011F2C2.6090208@unimelb.edu.au> References: <50040433.6040104@brightcomputing.com> <5011F2C2.6090208@unimelb.edu.au> Message-ID: <501655FF.4060403@brightcomputing.com> Thank you Chris, Probably, we will patch it. -- Taras On 07/27/2012 03:45 AM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 16/07/12 22:08, Taras Shapovalov wrote: > >> Is it possible to change interface, which pbs_sched is listening >> to? It looks like pbs_sched does not use TRQ_IFNAME from >> torque.cfg. > Just had a look at the 4.1-fixes branch, it appears the code > just gets the IP address for whatever gethostname() returns. > > So it doesn't look like it's possible as written, but it > should be possible for a suitably motivated person to patch > it to do so! > > cheers, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAlAR8sIACgkQO2KABBYQAh8HYgCfShybjqPfitQmCiuQk/clPgJp > dQQAn0cFEH2dTBbh1FuyJEWaVGnc/1m5 > =UdsG > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jwilkinson at stoneeagle.com Mon Jul 30 08:40:48 2012 From: jwilkinson at stoneeagle.com (Jack Wilkinson) Date: Mon, 30 Jul 2012 14:40:48 +0000 Subject: [torqueusers] A couple of user based questions... In-Reply-To: <2051367844.3061221.1343633998554.JavaMail.root@scai.fraunhofer.de> References: <00051F5C670B8444B35CB2B31B9B1D090C498907@se-ex2.stoneeagle.com> <2051367844.3061221.1343633998554.JavaMail.root@scai.fraunhofer.de> Message-ID: <00051F5C670B8444B35CB2B31B9B1D090C498A27@se-ex2.stoneeagle.com> Looks like we're running the built in pbs scheduler. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Andr? Gem?nd Sent: Monday, July 30, 2012 2:40 AM To: Torque Users Mailing List Subject: Re: [torqueusers] A couple of user based questions... > 2) I did not configure this setup. The only way I?ve found to be able > to spread four jobs across the four boxes is to specifically request a > node by name. For example: > #PBS ?l nodes=1:ppn=1 > #PBS ?l nodes=vpaybatch01 (and here, for each of the remaining files I > do 02, 03, 04) Is there some way that I can just qsub four jobs and > have them each go to the next available batch box? That depends on the scheduler setup. You have to check if you are using maui/moab, or pbs_sched (the builtin scheduler). For Maui/Moab, have a look at NODEALLOCATIONPOLICY http://www.adaptivecomputing.com/resources/docs/maui/5.2nodeallocation.php Greetings Andr? -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From andre.gemuend at scai.fraunhofer.de Mon Jul 30 08:53:19 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Mon, 30 Jul 2012 16:53:19 +0200 (CEST) Subject: [torqueusers] A couple of user based questions... In-Reply-To: <00051F5C670B8444B35CB2B31B9B1D090C498A27@se-ex2.stoneeagle.com> Message-ID: <1892439944.3103057.1343659999135.JavaMail.root@scai.fraunhofer.de> I assume your boxes have multiple cores? So Torque just packs as many jobs on the nodes as cores are available. You can set server node_pack false to scatter jobs across all nodes, even if the first node has free cores. hth Andre ----- Urspr?ngliche Mail ----- > Looks like we're running the built in pbs scheduler. > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Andr? > Gem?nd > Sent: Monday, July 30, 2012 2:40 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] A couple of user based questions... > > > 2) I did not configure this setup. The only way I?ve found to be > > able > > to spread four jobs across the four boxes is to specifically > > request a > > node by name. For example: > > #PBS ?l nodes=1:ppn=1 > > #PBS ?l nodes=vpaybatch01 (and here, for each of the remaining > > files I > > do 02, 03, 04) Is there some way that I can just qsub four jobs and > > have them each go to the next available batch box? > > That depends on the scheduler setup. You have to check if you are > using maui/moab, or pbs_sched (the builtin scheduler). > For Maui/Moab, have a look at NODEALLOCATIONPOLICY > http://www.adaptivecomputing.com/resources/docs/maui/5.2nodeallocation.php > > Greetings > Andr? > > -- > Andr? Gem?nd > Fraunhofer-Institute for Algorithms and Scientific Computing > andre.gemuend at scai.fraunhofer.de > Tel: +49 2241 14-2193 > /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From tran.v.allan at gmail.com Mon Jul 30 09:37:03 2012 From: tran.v.allan at gmail.com (Allan Tran) Date: Mon, 30 Jul 2012 09:37:03 -0600 Subject: [torqueusers] Torque/Maui queues priority questions Message-ID: Hi everyone, I'm taking a new role of managing a small cluster. I worked with torque/maui before but just a single batch queue. I need some advice/guidence for following configurations: Current configs: Large queue: 20 nodes (unlimited hours) normal queue: 4 nodes (2 hrs) Route queue to sends job to appropriate large/small based on resource request, default is Now a new department purchase their own compute nodes (8 nodes) and like to have 2 queues A (max 4 hours) and B (unlimited walltime) They will share these nodes with the rest of the cluster users but when they need to run jobs, they want to have the first priority to use new 8 nodes (queue A and/or B) What I'm thinking now is add a new group for this department and only allow people in the group to run in A or B but I think that will prevent everyone else from using these nodes when the nodes are idle. Can anyone provide me a better solution? Thank you much. Allan. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120730/c436da7b/attachment.html From bdandrus at nps.edu Mon Jul 30 10:23:15 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Mon, 30 Jul 2012 16:23:15 +0000 Subject: [torqueusers] Job not running due to features when node name specified Message-ID: I am a bit confused as to how to troubleshoot and understand why this is. Running torque 2.5.12 and moab 6.1.6 I submit a job with a specific node: qsub -l nodes=compute-3-1 It queues up fine, but never runs. qshow shows it as eligible and queued. When I run checkjob -v on it it says it shows: compute-3-1 rejected: Features ??? Um ok... qstat shows: Resource_List.nodect = 1 Resource_List.nodes = compute-3-1 Resource_List.pmem = 1gb But I can force it with 'qrun 839' and it does run on the node requested. One thing I would REALLY like to know is how to determine the specific features a job is being rejected for on a particular node. And why would a job be rejected due to features when it clearly can and should run? FWIW, compute-3-1 has no other jobs on it at the time. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From tran.v.allan at gmail.com Mon Jul 30 15:46:54 2012 From: tran.v.allan at gmail.com (Allan Tran) Date: Mon, 30 Jul 2012 15:46:54 -0600 Subject: [torqueusers] Test email list Message-ID: Testing. I sent one today but somehow didn't make to the list. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120730/ea11041c/attachment.html From roman.ricardo at gmail.com Mon Jul 30 15:50:30 2012 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Mon, 30 Jul 2012 15:50:30 -0600 Subject: [torqueusers] Test email list In-Reply-To: References: Message-ID: got it On Mon, Jul 30, 2012 at 3:46 PM, Allan Tran wrote: > Testing. I sent one today but somehow didn't make to the list. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120730/ea7c4489/attachment.html From tran.v.allan at gmail.com Mon Jul 30 15:54:59 2012 From: tran.v.allan at gmail.com (Allan Tran) Date: Mon, 30 Jul 2012 15:54:59 -0600 Subject: [torqueusers] Torque/maui priority questions. Message-ID: Sorry if this is a duplicate email. I need some advice/guidence for following configurations: Large queue: 20 nodes (unlimited hours) normal queue: 4 nodes (2 hrs) And a Route queue to sends job to appropriate large/small based on resource request Now a new department purchased their own compute nodes (8 nodes) and like to have 2 queues A (max 4 hours) and B (unlimited walltime) They will share these nodes with the rest of the cluster users but whenever they run jobs, they want to have the first priority to use new 8 nodes (queue A and/or B) What I'm thinking now is add a new group for this department and only allow people in the group to run in A or B but I think that will prevent everyone else from using these nodes when the nodes are idle. Can anyone provide me a good solution? Thank you much. I'm running torque 2.3.6 and maui 3.2.6p21 Allan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120730/624ebeff/attachment-0001.html From dbeer at adaptivecomputing.com Mon Jul 30 17:26:10 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 30 Jul 2012 17:26:10 -0600 Subject: [torqueusers] Changing --with-debug to the default Message-ID: All, Our support team has asked us repeatedly about making --with-debug the default for TORQUE, making it so it can be disabled using --without-debug, but obviously being on by default. Our support team wants this so that when a site reports a core dump, the core has information in it and the admin doesn't have to recompile and try to reproduce in order to get enough information to debug the crash. What are your thoughts on making this change? -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120730/88053a52/attachment.html From samuel at unimelb.edu.au Mon Jul 30 22:52:09 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 31 Jul 2012 14:52:09 +1000 Subject: [torqueusers] pbs_sched listening interface In-Reply-To: <501655FF.4060403@brightcomputing.com> References: <50040433.6040104@brightcomputing.com> <5011F2C2.6090208@unimelb.edu.au> <501655FF.4060403@brightcomputing.com> Message-ID: <50176479.8000401@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 30/07/12 19:38, Taras Shapovalov wrote: > Thank you Chris, Probably, we will patch it. Best of luck - if you get it working I'd encourage you to submit it as a patch to go into the mainline via the Torque bugzilla: http://www.clusterresources.com/bugzilla/ cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAXZHkACgkQO2KABBYQAh/d/wCgip6TNOx36/17x2TIZSsG3tXG xQ8AniJhF4B/y2Wlj/pSvXb4FU047Hdv =hMum -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Mon Jul 30 22:57:37 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 31 Jul 2012 14:57:37 +1000 Subject: [torqueusers] Hostname problems on compute node with pbs_mom? In-Reply-To: <5013FF49.1080407@att.net> References: <5013FF49.1080407@att.net> Message-ID: <501765C1.3040002@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 29/07/12 01:03, Jeff Layton wrote: > 07/28/2012 11:24:20;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::pbs_mom, > Unable to get my full hostname for n0001 error -1 There's a bunch of reasons that can fail looking at the code, but they only get reported if the optional argument EMsg is passed through to get_fullhostname() which isn't the case for the call site that leads to that error.. :-( You might need to instrument the code to find out where it's failing.. I'd also suggest reporting a bug about it not logging those errors by default. http://www.clusterresources.com/bugzilla/ cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAXZcEACgkQO2KABBYQAh9JpACcDXdCkO3///MDQbl5tz6Yy6em 2QMAn1EZMvsTyJZfO28yL0VCcJ0k5xbU =eYK5 -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Jul 31 00:34:00 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 31 Jul 2012 16:34:00 +1000 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: References: Message-ID: <50177C58.5060104@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 31/07/12 09:26, David Beer wrote: > Our support team has asked us repeatedly about making --with-debug > the default for TORQUE, making it so it can be disabled using > --without-debug, but obviously being on by default. Our support > team wants this so that when a site reports a core dump, the core > has information in it and the admin doesn't have to recompile and > try to reproduce in order to get enough information to debug the > crash. What are your thoughts on making this change? Never tried it I'm afraid - does it just enable the "DEBUG" preprocessor symbol? If so then that might have some unexpected effects, a quick grep shows that will result in extra output in pbsdsh and could potentially conflict in the DRMAA code (which has its own "DEBUG" symbol). I can't check as the 4.1-fixes branch currently won't build on Ubuntu, the check for the libxml2 dev package fails due to an unrelated problem. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAXfFgACgkQO2KABBYQAh8gEACfUacx36mL/46Mh1IP9T+87x2V LxMAn3sq7LYQzxpEVZ4+wV8awGJH//4w =USRg -----END PGP SIGNATURE----- From l.flis at cyf-kr.edu.pl Tue Jul 31 01:11:08 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Tue, 31 Jul 2012 09:11:08 +0200 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: <50177C58.5060104@unimelb.edu.au> References: <50177C58.5060104@unimelb.edu.au> Message-ID: <5017850C.8010409@cyf-kr.edu.pl> Hi Chris, Hi All We are using our custom-build torque rpm's compiled with '--with-debug' since 1.5 year now. We run 2.5.12 on cluster 12k+ core cluster, 208 GPGPU cards. I haven't noticed any overheads nor problems which Chris descibes (no extra output for pbsdsh) Thanks to the flag we were able to spot and report few bugs in torque server. As for me - it is perfectly safe to use it and I would recommend to turn it on for bigger installations. Cheers, -- Lukasz Flis On 07/31/2012 08:34 AM, Christopher Samuel wrote: > On 31/07/12 09:26, David Beer wrote: > >> Our support team has asked us repeatedly about making >> --with-debug the default for TORQUE, making it so it can be >> disabled using --without-debug, but obviously being on by >> default. Our support team wants this so that when a site reports >> a core dump, the core has information in it and the admin doesn't >> have to recompile and try to reproduce in order to get enough >> information to debug the crash. What are your thoughts on making >> this change? > > Never tried it I'm afraid - does it just enable the "DEBUG" > preprocessor symbol? > > If so then that might have some unexpected effects, a quick grep > shows that will result in extra output in pbsdsh and could > potentially conflict in the DRMAA code (which has its own "DEBUG" > symbol). > > I can't check as the 4.1-fixes branch currently won't build on > Ubuntu, the check for the libxml2 dev package fails due to an > unrelated problem. > > cheers, Chris _______________________________________________ > torqueusers mailing list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From andre.gemuend at scai.fraunhofer.de Tue Jul 31 01:26:02 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Tue, 31 Jul 2012 09:26:02 +0200 (CEST) Subject: [torqueusers] Torque/maui priority questions. In-Reply-To: Message-ID: <593989726.3205397.1343719562787.JavaMail.root@scai.fraunhofer.de> I'd suggest to use a combination of standing reservation and preemption in this case. The new department is the owner of a standing reservation on the new nodes with FLAGS=OWNERPREEMPT, setting the nodes in the hostlist. These jobs will be preemptor for the jobs of the other departments on these resources. Depending on the type of jobs you use, you can choose suspension, requeuing or checkpointing. Suspension naturally has the problem that the memory will still be allocated by the running process (its just SIGSTOP, as if you'd press ctrl+z on the tty). hth Andre ----- Urspr?ngliche Mail ----- > > Sorry if this is a duplicate email. > > I need some advice/guidence for following configurations: > Large queue: 20 nodes (unlimited hours) > normal queue: 4 nodes (2 hrs) > And a Route queue to sends job to appropriate large/small based on > resource request > > Now a new department purchased their own compute nodes (8 nodes) and > like to have 2 queues A (max 4 hours) and B (unlimited walltime) > They will share these nodes with the rest of the cluster users but > whenever they run jobs, they want to have the first priority to use > new 8 nodes (queue A and/or B) > > What I'm thinking now is add a new group for this department and only > allow people in the group to run in A or B but I think that will > prevent everyone else from using these nodes when the nodes are > idle. > Can anyone provide me a good solution? Thank you much. > I'm running torque 2.3.6 and maui 3.2.6p21 > > Allan > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From Gareth.Williams at csiro.au Tue Jul 31 04:35:31 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Tue, 31 Jul 2012 20:35:31 +1000 Subject: [torqueusers] qpeek in contrib Message-ID: <007DECE986B47F4EABF823C1FBB19C62010529679FFE@exvic-mbx04.nexus.csiro.au> Hi, We've found a need for a command to look at jobs output before it is finished so tried the qpeek script from OSC in the torque contrib directory. With a few minor mods it works fine but I want to check a few things (different people may be needed to answer each question): 1) That qpeek is marked: Copyright 2001, Ohio Supercomputer Center but with no author or licensing info. Since it is clearly in a contrib(uted) section of an open source project I'm assuming that modifying it for local use is fair game - though I'd be cautious about redistributing changes. Does anyone know better? 2) If I improve that qpeek, who might I contribute the changes back to? 3) Does anyone have an alternative to share? I see nf.nci.org.au has a qcat which serves a similar purpose with a mixture of more and less functionality (it will also show the script but does not head, tail or follow). Frankly I'd prefer not to write my own! Thanks, Gareth From knielson at adaptivecomputing.com Tue Jul 31 07:55:08 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 31 Jul 2012 07:55:08 -0600 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: <50177C58.5060104@unimelb.edu.au> References: <50177C58.5060104@unimelb.edu.au> Message-ID: On Tue, Jul 31, 2012 at 12:34 AM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 31/07/12 09:26, David Beer wrote: > > > Our support team has asked us repeatedly about making --with-debug > > the default for TORQUE, making it so it can be disabled using > > --without-debug, but obviously being on by default. Our support > > team wants this so that when a site reports a core dump, the core > > has information in it and the admin doesn't have to recompile and > > try to reproduce in order to get enough information to debug the > > crash. What are your thoughts on making this change? > > Never tried it I'm afraid - does it just enable the "DEBUG" > preprocessor symbol? > > If so then that might have some unexpected effects, a quick grep shows > that will result in extra output in pbsdsh and could potentially > conflict in the DRMAA code (which has its own "DEBUG" symbol). > Chris, --enable-debug will activate the DEBUG macros. --with-debug simply compiles TORQUE using the -g option in CFLAGS. It does not create any output. Ken > > I can't check as the 4.1-fixes branch currently won't build on Ubuntu, > the check for the libxml2 dev package fails due to an unrelated problem. > > cheers, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAlAXfFgACgkQO2KABBYQAh8gEACfUacx36mL/46Mh1IP9T+87x2V > LxMAn3sq7LYQzxpEVZ4+wV8awGJH//4w > =USRg > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120731/24ee8daf/attachment.html From dbeer at adaptivecomputing.com Tue Jul 31 09:49:52 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 31 Jul 2012 09:49:52 -0600 Subject: [torqueusers] Hostname problems on compute node with pbs_mom? In-Reply-To: <501765C1.3040002@unimelb.edu.au> References: <5013FF49.1080407@att.net> <501765C1.3040002@unimelb.edu.au> Message-ID: What this error message almost certainly means is that the call to either getnameinfo() or getaddrinfo() failed. Is there any reason to believe that is possible? David On Mon, Jul 30, 2012 at 10:57 PM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 29/07/12 01:03, Jeff Layton wrote: > > > 07/28/2012 11:24:20;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::pbs_mom, > > Unable to get my full hostname for n0001 error -1 > > There's a bunch of reasons that can fail looking at the code, but they > only get reported if the optional argument EMsg is passed through to > get_fullhostname() which isn't the case for the call site that leads > to that error.. :-( > > You might need to instrument the code to find out where it's failing.. > > I'd also suggest reporting a bug about it not logging those errors by > default. http://www.clusterresources.com/bugzilla/ > > cheers, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAlAXZcEACgkQO2KABBYQAh9JpACcDXdCkO3///MDQbl5tz6Yy6em > 2QMAn1EZMvsTyJZfO28yL0VCcJ0k5xbU > =eYK5 > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120731/64a092f8/attachment.html From craig.tierney at noaa.gov Tue Jul 31 12:39:11 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Tue, 31 Jul 2012 12:39:11 -0600 Subject: [torqueusers] qpeek in contrib In-Reply-To: <007DECE986B47F4EABF823C1FBB19C62010529679FFE@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C62010529679FFE@exvic-mbx04.nexus.csiro.au> Message-ID: <5018264F.20500@noaa.gov> On 7/31/12 4:35 AM, Gareth.Williams at csiro.au wrote: > Hi, > > We've found a need for a command to look at jobs output before it is finished so tried the qpeek script from OSC in the torque contrib directory. With a few > minor mods it works fine but I want to check a few things (different people may be needed to answer each question): > > 1) That qpeek is marked: Copyright 2001, Ohio Supercomputer Center but with no author or licensing info. Since it is clearly in a contrib(uted) section of an > open source project I'm assuming that modifying it for local use is fair game - though I'd be cautious about redistributing changes. Does anyone know > better? > > 2) If I improve that qpeek, who might I contribute the changes back to? > > 3) Does anyone have an alternative to share? I see nf.nci.org.au has a qcat which serves a similar purpose with a mixture of more and less functionality (it > will also show the script but does not head, tail or follow). > > Frankly I'd prefer not to write my own! > Gareth, Do you have a shared filesystem between your systems? If so, why not enable "$spool_as_final_name true" in mom_priv/config. http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/a.cmomconfig.php Craig > Thanks, > > Gareth > > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From craig.tierney at noaa.gov Tue Jul 31 12:41:18 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Tue, 31 Jul 2012 12:41:18 -0600 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: References: Message-ID: <501826CE.8020109@noaa.gov> On 7/30/12 5:26 PM, David Beer wrote: > All, > > Our support team has asked us repeatedly about making --with-debug the default for TORQUE, making it so it can be disabled using --without-debug, but > obviously being on by default. Our support team wants this so that when a site reports a core dump, the core has information in it and the admin doesn't have > to recompile and try to reproduce in order to get enough information to debug the crash. What are your thoughts on making this change? > > -- David Beer | Software Engineer Adaptive Computing > > David, If all this does is compile with -g, it sounds like a good idea. I am building Torque with this now (350 nodes now, will be 2500 by EOY), and if I see any issues I will report it. My concern is performance. Is there any reason this would slow the server down? Craig > > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From tbaer at utk.edu Tue Jul 31 12:48:43 2012 From: tbaer at utk.edu (Troy Baer) Date: Tue, 31 Jul 2012 14:48:43 -0400 Subject: [torqueusers] qpeek in contrib In-Reply-To: <007DECE986B47F4EABF823C1FBB19C62010529679FFE@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C62010529679FFE@exvic-mbx04.nexus.csiro.au> Message-ID: <1343760523.20484.421.camel@browncoat.jics.utk.edu> On Tue, 2012-07-31 at 20:35 +1000, Gareth.Williams at csiro.au wrote: > We've found a need for a command to look at jobs output before it is > finished so tried the qpeek script from OSC in the torque contrib > directory. With a few minor mods it works fine but I want to check a > few things (different people may be needed to answer each question): > > 1) That qpeek is marked: Copyright 2001, Ohio Supercomputer Center > but with no author or licensing info. Since it is clearly in a > contrib(uted) section of an open source project I'm assuming that > modifying it for local use is fair game - though I'd be cautious about > redistributing changes. Does anyone know better? qpeek is part of PBS tools[1], which is under GPL v2. The version in the TORQUE contrib area must be ancient, because the current svn head version has the following in its header: # Copyright 2006, 2007 Ohio Supercomputer Center # # License: GNU GPL v2; see ../COPYING for details. [1] http://www.nics.tennessee.edu/~troy/pbstools/ > 2) If I improve that qpeek, who might I contribute the changes back > to? Please send those to me, as I'm retaining PBS tools. --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From mej at lbl.gov Tue Jul 31 12:52:15 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 31 Jul 2012 11:52:15 -0700 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: References: Message-ID: <20120731185214.GC5670@lbl.gov> On Monday, 30 July 2012, at 17:26:10 (-0600), David Beer wrote: > Our support team has asked us repeatedly about making --with-debug > the default for TORQUE, making it so it can be disabled using > --without-debug, but obviously being on by default. Our support team > wants this so that when a site reports a core dump, the core has > information in it and the admin doesn't have to recompile and try to > reproduce in order to get enough information to debug the > crash. What are your thoughts on making this change? There are a few considerations here. 1 - Currently --with-debug eliminates all optimizations in $CFLAGS to ensure that debugging symbols are accurate. Some sites may object to the potential performance penalty. Many projects use -O2 -g (or -O2 -g3) with reasonable success. You may want to consider splitting up the adding of -g3 (--with-debug) and the removal of optimization (perhaps --without-optimization?). 2 - RPM builds already (by default on most systems) generate separate "debuginfo" packages which contain the debugging symbols for the main package(s). As mentioned above, while some symbols may be marked as "optimized out," much of the debugging is still valid in more recent gcc/gdb versions. 3 - I submitted a patch to Ken for a "--with debug" feature for the spec file which would activate --with-debug AND allow the symbols to remain inside the primary RPM packages instead of being moved to the torque-debuginfo package. It doesn't seem to have appeared in SVN yet, so you may want to take another gander. That said, I'm all for making -g3 a default flag. Just not sure about the rest. ;-) Also, you may want to consider renaming --with-debug to --with-symbols or --with-gdb-symbols instead. Having both --enable-debug and --with-debug (and having them do entirely disparate things) is rather confusing for new users. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From dbeer at adaptivecomputing.com Tue Jul 31 13:56:53 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 31 Jul 2012 13:56:53 -0600 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: <20120731185214.GC5670@lbl.gov> References: <20120731185214.GC5670@lbl.gov> Message-ID: Michael, On Tue, Jul 31, 2012 at 12:52 PM, Michael Jennings wrote: > On Monday, 30 July 2012, at 17:26:10 (-0600), > David Beer wrote: > > > Our support team has asked us repeatedly about making --with-debug > > the default for TORQUE, making it so it can be disabled using > > --without-debug, but obviously being on by default. Our support team > > wants this so that when a site reports a core dump, the core has > > information in it and the admin doesn't have to recompile and try to > > reproduce in order to get enough information to debug the > > crash. What are your thoughts on making this change? > > There are a few considerations here. > > 1 - Currently --with-debug eliminates all optimizations in $CFLAGS to > ensure that debugging symbols are accurate. Some sites may object > to the potential performance penalty. Many projects use -O2 -g > (or -O2 -g3) with reasonable success. You may want to consider > splitting up the adding of -g3 (--with-debug) and the removal of > optimization (perhaps --without-optimization?). > Would you still think its worth it to separate these if I told you that the default would be without optimization as well? The benefits of having debugging symbols are compromised if optimizations are on, as you point out. > 2 - RPM builds already (by default on most systems) generate separate > "debuginfo" packages which contain the debugging symbols for the > main package(s). As mentioned above, while some symbols may be > marked as "optimized out," much of the debugging is still valid in > more recent gcc/gdb versions. > 3 - I submitted a patch to Ken for a "--with debug" feature for the > spec file which would activate --with-debug AND allow the symbols > to remain inside the primary RPM packages instead of being moved > to the torque-debuginfo package. It doesn't seem to have appeared > in SVN yet, so you may want to take another gander. > > We will have to look into this patch. I don't believe I've seen it yet. > That said, I'm all for making -g3 a default flag. Just not sure about > the rest. ;-) > > Also, you may want to consider renaming --with-debug to --with-symbols > or --with-gdb-symbols instead. Having both --enable-debug and > --with-debug (and having them do entirely disparate things) is rather > confusing for new users. > > Does anyone else have an opinion on this? I'm interested in making things unambiguous. David > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120731/875ccab9/attachment.html From mej at lbl.gov Tue Jul 31 14:17:15 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 31 Jul 2012 13:17:15 -0700 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: References: <20120731185214.GC5670@lbl.gov> Message-ID: <20120731201714.GH5670@lbl.gov> On Tuesday, 31 July 2012, at 13:56:53 (-0600), David Beer wrote: > Would you still think its worth it to separate these if I told you > that the default would be without optimization as well? The benefits > of having debugging symbols are compromised if optimizations are on, > as you point out. Compromised, but not eliminated entirely. -O0 -g3 is the most useful, but -O2 -g3 is more useful than -O2. :-) I'm still in favor of them being split up regardless. That way, a particular site can make the choice between larger binaries with symbols (-g3) vs. smaller binaries with addresses only (-g0) entirely independent of the decision between higher (-O1/2/3/s) and lower (-O0) performance. As you know, adding debugging symbols has no effect on performance, but disabling optimization almost certainly would. :-) For our site, we haven't noticed a performance penalty from the -O0 -g3 packages I've built over the last couple weeks, but that doesn't mean other sites won't see one. Have you all done any performance testing between the various optimization levels? That might give you some valuable insights into the implications of what you're proposing.... :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From Gareth.Williams at csiro.au Tue Jul 31 15:13:38 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 1 Aug 2012 07:13:38 +1000 Subject: [torqueusers] qpeek in contrib In-Reply-To: <5018264F.20500@noaa.gov> References: <007DECE986B47F4EABF823C1FBB19C62010529679FFE@exvic-mbx04.nexus.csiro.au> <5018264F.20500@noaa.gov> Message-ID: <007DECE986B47F4EABF823C1FBB19C62010529679FFF@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Craig Tierney [mailto:craig.tierney at noaa.gov] > Sent: Wednesday, 1 August 2012 4:39 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] qpeek in contrib > > On 7/31/12 4:35 AM, Gareth.Williams at csiro.au wrote: > > Hi, > > > > We've found a need for a command to look at jobs output before it is > > finished so tried the qpeek script from OSC in the torque contrib > directory. With a few minor mods it works fine but I want to check a > few things (different people may be needed to answer each question): > > > > 1) That qpeek is marked: Copyright 2001, Ohio Supercomputer Center > but > > with no author or licensing info. Since it is clearly in a > > contrib(uted) section of an open source project I'm assuming that > modifying it for local use is fair game - though I'd be cautious about > redistributing changes. Does anyone know better? > > > > 2) If I improve that qpeek, who might I contribute the changes back > to? > > > > 3) Does anyone have an alternative to share? I see nf.nci.org.au has > > a qcat which serves a similar purpose with a mixture of more and less > functionality (it will also show the script but does not head, tail or > follow). > > > > Frankly I'd prefer not to write my own! > > > > Gareth, > > Do you have a shared filesystem between your systems? > If so, why not enable "$spool_as_final_name true" in mom_priv/config. > > http://www.adaptivecomputing.com/resources/docs/torque/2-5- > 9/a.cmomconfig.php > > Craig Hi Craig, Thanks but that is pretty much the situation we are trying to avoid! We are finding that many small writes to a shared filesystem cause a performance impact. We figure users are writing to a shared filesystem because they want feedback so if we can give them a qpeek, they can be happy to spool the small writes to a local filesystem instead. Gareth From Gareth.Williams at csiro.au Tue Jul 31 17:25:04 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 1 Aug 2012 09:25:04 +1000 Subject: [torqueusers] qpeek in contrib In-Reply-To: <1343760523.20484.421.camel@browncoat.jics.utk.edu> References: <007DECE986B47F4EABF823C1FBB19C62010529679FFE@exvic-mbx04.nexus.csiro.au> <1343760523.20484.421.camel@browncoat.jics.utk.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C6201052967A001@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Troy Baer [mailto:tbaer at utk.edu] > Sent: Wednesday, 1 August 2012 4:49 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] qpeek in contrib > > On Tue, 2012-07-31 at 20:35 +1000, Gareth.Williams at csiro.au wrote: > > We've found a need for a command to look at jobs output before it is > > finished so tried the qpeek script from OSC in the torque contrib > > directory. With a few minor mods it works fine but I want to check a > > few things (different people may be needed to answer each question): > > > > 1) That qpeek is marked: Copyright 2001, Ohio Supercomputer Center > but > > with no author or licensing info. Since it is clearly in a > > contrib(uted) section of an open source project I'm assuming that > > modifying it for local use is fair game - though I'd be cautious > about > > redistributing changes. Does anyone know better? > > qpeek is part of PBS tools[1], which is under GPL v2. The version in > the TORQUE contrib area must be ancient, because the current svn head > version has the following in its header: > > # Copyright 2006, 2007 Ohio Supercomputer Center # # License: GNU GPL > v2; see ../COPYING for details. > > [1] http://www.nics.tennessee.edu/~troy/pbstools/ > > > 2) If I improve that qpeek, who might I contribute the changes back > > to? > > Please send those to me, as I'm retaining PBS tools. > > --Troy > -- > Troy Baer, Senior HPC System Administrator National Institute for > Computational Sciences, University of Tennessee > http://www.nics.tennessee.edu/ > Phone: 865-241-4233 Thanks Troy. I've checked out pbstools and applied my modifications to that qpeek. I'll send you patches directly. For reference, patches are needed for: - array job support - use ssh rather than rsh - peeking at the job script (might not be completely portable) - apparently "chop to magic pbs length" is no longer needed - numa-enabled torque with virtual node naming (this change might be tricky to make portable) Gareth From samuel at unimelb.edu.au Tue Jul 31 19:26:26 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 01 Aug 2012 11:26:26 +1000 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: <20120731201714.GH5670@lbl.gov> References: <20120731185214.GC5670@lbl.gov> <20120731201714.GH5670@lbl.gov> Message-ID: <501885C2.1070509@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/08/12 06:17, Michael Jennings wrote: > Have you all done any performance testing between the various > optimization levels? That might give you some valuable insights > into the implications of what you're proposing.... :-) I think that's a very valid question, I would have thought Torque would spend a lot of time in I/O (either filesystem or sockets) so perhaps optimisation may not buy that much. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAYhcIACgkQO2KABBYQAh89zgCdGQpMmkOY4Uu/vwCL741UPpCP K68Anj326OxgikSMDXWgehBmQIvueNBk =8ODG -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Jul 31 19:26:57 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 01 Aug 2012 11:26:57 +1000 Subject: [torqueusers] Changing --with-debug to the default In-Reply-To: References: <50177C58.5060104@unimelb.edu.au> Message-ID: <501885E1.6070404@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 31/07/12 23:55, Ken Nielson wrote: > --enable-debug will activate the DEBUG macros. --with-debug simply > compiles TORQUE using the -g option in CFLAGS. It does not create > any output. Oh! I'd assumed they'd do the same thing.. Interesting! cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlAYheAACgkQO2KABBYQAh/L8gCcDEc4s9HtjAs6UZzLw7yD4uJ8 KH4AoJFItymkEao8dzzlgfNpfFfZcx+y =Hlwn -----END PGP SIGNATURE-----