From knielson at adaptivecomputing.com Thu Sep 1 10:22:10 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 01 Sep 2011 10:22:10 -0600 (MDT) Subject: [torqueusers] Autoconf and TORQUE In-Reply-To: <677cf157-4407-4288-b68a-937654063ce1@mail> Message-ID: <63b736cf-a0bd-4a9d-a554-94f0bf7cde80@mail> Hi all, The system we use to build TORQUE uses autoconf version 2.59 and automake 1.9.2. The version of autoconf and automake do not matter to TORQUE 3.0.x but it does make a difference with 2.4.x and 2.5.x because we deliver the Makefile.in files with the build. My question is, does anyone depend on TORQUE building with autoconf 2.59 or automake 1.9.2? Can we upgrade to newer versions? Regards Ken From roy.dragseth at cc.uit.no Fri Sep 2 01:05:04 2011 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Fri, 2 Sep 2011 09:05:04 +0200 Subject: [torqueusers] qsub patch for job script arguments. In-Reply-To: <4E5C8D17.4000803@sara.nl> References: <201108272246.53285.roy.dragseth@cc.uit.no> <20110829224654.GC2998@lbl.gov> <4E5C8D17.4000803@sara.nl> Message-ID: <201109020905.04317.roy.dragseth@cc.uit.no> On Tuesday, August 30, 2011 09:11:19 Bas van der Vlies wrote: > On 30-08-11 00:46, Michael Jennings wrote: > > On Tuesday, 30 August 2011, at 08:43:09 (+1000), > > > > Gareth.Williams at csiro.au wrote: > >> I like the -- idea, but I think it is more natural to have (or at least > >> allow) -- before the job script: qsub -- run.sh arg1 arg2 > >> Before is better to signify that qsub options are complete and the rest > >> of the line is the job script and its options. After still signifies > >> this but splits the job script and its options unnecessarily. > > > > That doesn't work. The job script is an argument to qsub. The > > convention is that -- halts option parsing; the command is supposed to > > ignore anything that comes after --. If qsub ignores the jobscript > > name, not much will get done. ;-) > > > > Using -- is by far the simplest and most standard approach IMHO. > > Either that, or just ignore everything after the jobscript name. > > I am also for -- solution. As said it is a standard approach and seen a lot > in other software. I would strongly argue for the -- approach as an addition to the -F flag. If there are any technical issues with the patch please let me know. r. From guilherme.consultor at gmail.com Fri Sep 2 03:17:45 2011 From: guilherme.consultor at gmail.com (Guilherme Rocha) Date: Fri, 2 Sep 2011 06:17:45 -0300 Subject: [torqueusers] Success setting up a new Torque Environment in University Message-ID: Hello folks, My name is Guilherme and this is my first post here. Thanks for this great project. We're setting-up a Torque Cluster with 23 nodes and will be used to bioinformatics tasks. I'm completely newbie to all of this, so after hard steps of troubleshooting, we finally received good news in logs. We did a small alginment using clustalw. But we have some doubts about how to use a program in parallel, like: Question 1) I need to have clustalw (or the script programs installed in all nodes?) Clustalw is only installed in head node by now. Question 2: Can we use/open GUI program's interfaces to work using torque? Question 3: When I submit a job, even requesting 10 nodes, clustal are being runned only in one node. What can be wrong? thanks in advance thanks in advance, -- -- Guilherme Rocha GF7 Doc & Systems - Solu??es Tecnol?gicas Pesquisa e Desenvolvimento - World Wide R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110902/96533015/attachment.html From knielson at adaptivecomputing.com Fri Sep 2 06:45:02 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 02 Sep 2011 06:45:02 -0600 (MDT) Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: Message-ID: ----- Original Message ----- > From: "Guilherme Rocha" > To: torqueusers at supercluster.org > Sent: Friday, September 2, 2011 3:17:45 AM > Subject: [torqueusers] Success setting up a new Torque Environment in University > > > Hello folks, > > > My name is Guilherme and this is my first post here. > Thanks for this great project. > > We're setting-up a Torque Cluster with 23 nodes and will be used to > bioinformatics tasks. > > I'm completely newbie to all of this, so after hard steps of > troubleshooting, we finally > received good news in logs. We did a small alginment using clustalw. > > But we have some doubts about how to use a program in parallel, like: > > > Question 1) I need to have clustalw (or the script programs installed > in all nodes?) > Clustalw is only installed in head node by now. > > Question 2: Can we use/open GUI program's interfaces to work using > torque? > > Question 3: When I submit a job, even requesting 10 nodes, clustal > are being runned > only in one node. What can be wrong? When TORQUE launches a mulit-node (parallel) job only one of the execution nodes receives the request from pbs_server. That node is called the mother superior. The mother superior then contacts the other sister nodes for the job and requests the sisters join the job. The script submitted on the qsub command line is then executed on the mother superior. In order to have the job also execute on the sister nodes you need something like MPI or you can write your own program to use the task management API in TORQUE. There is also a utility in TORQUE called pbsdsh that can be used in the job script to make the program requested run on all of the nodes requested in the job. Regards Ken Nielson Adaptive Computing > > > thanks in advance > > thanks in advance, > > -- > -- > Guilherme Rocha > GF7 Doc & Systems - Solu??es Tecnol?gicas > Pesquisa e Desenvolvimento - World Wide > R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 > Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From gus at ldeo.columbia.edu Fri Sep 2 07:46:51 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 02 Sep 2011 09:46:51 -0400 Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: References: Message-ID: <4E60DE4B.3090300@ldeo.columbia.edu> Hi Guilherme To add to Ken's suggestions. 1) Make sure on the [head] node where you run pbs_server your ${TORQUE}/server_priv/nodes file describes correctly your cluster configuration. Something like this [minimally, there are more options]: node01 np=8 node02 np=8 ... node23 np=8 (Assuming the nodes's hostnames are node01, etc, and have 8 cpu cores each. Adjust to your reality.) If not, edit that file, and restart your pbs_server. 2) To test, as Ken said, use pbsdsh. See the pbsdsh man pages. Check also the qsub man page. Although you can pass the job parameters to qsub via command line, you may want to write a little script: #PBS -q batch #PBS -l nodes=4:ppn=8 #PBS -n my_job ... mpiexec -np 32 ./my_mpi_program [or the pbsdsh command] and do 'qsub my_pbs_script' 2) I am not familiar to clustalw, which seems to be a genome sequencing program. Is it parallel? Does it use MPI? Does it require a specific flavor of MPI. If it works with any MPI, you could install MPICH2 (if your network is Gigabit Ethernet), MVAPICH2 (if your network is Infiniband), or OpenMPI (for Gigabit Ethernet, or Infiniband, or Myrinet). All of them can be built with the Gnu compilers (gcc,g++,gfortran), or with commercial compilers (Intel, PGI, etc). See: http://www.open-mpi.org/ http://www.mcs.anl.gov/research/projects/mpich2/ http://mvapich.cse.ohio-state.edu/overview/mvapich2/ I am fond of OpenMPI because it is very flexible and has excellent integration to Torque. Then you need to build clustall linking it to your MPI. Typically this is easier to do using the the MPI compiler wrappers (mpicc, mpicxx, mpif77, mpif90). I hope this helps, Gus Correa Ken Nielson wrote: > ----- Original Message ----- >> From: "Guilherme Rocha" >> To: torqueusers at supercluster.org >> Sent: Friday, September 2, 2011 3:17:45 AM >> Subject: [torqueusers] Success setting up a new Torque Environment in University >> >> >> Hello folks, >> >> >> My name is Guilherme and this is my first post here. >> Thanks for this great project. >> >> We're setting-up a Torque Cluster with 23 nodes and will be used to >> bioinformatics tasks. >> >> I'm completely newbie to all of this, so after hard steps of >> troubleshooting, we finally >> received good news in logs. We did a small alginment using clustalw. >> >> But we have some doubts about how to use a program in parallel, like: >> >> >> Question 1) I need to have clustalw (or the script programs installed >> in all nodes?) >> Clustalw is only installed in head node by now. >> >> Question 2: Can we use/open GUI program's interfaces to work using >> torque? >> >> Question 3: When I submit a job, even requesting 10 nodes, clustal >> are being runned >> only in one node. What can be wrong? > When TORQUE launches a mulit-node (parallel) job only one of the execution nodes receives the request from pbs_server. That node is called the mother superior. The mother superior then contacts the other sister nodes for the job and requests the sisters join the job. The script submitted on the qsub command line is then executed on the mother superior. In order to have the job also execute on the sister nodes you need something like MPI or you can write your own program to use the task management API in TORQUE. There is also a utility in TORQUE called pbsdsh that can be used in the job script to make the program requested run on all of the nodes requested in the job. > > Regards > > Ken Nielson > Adaptive Computing >> >> thanks in advance >> >> thanks in advance, >> >> -- >> -- >> Guilherme Rocha >> GF7 Doc & Systems - Solu??es Tecnol?gicas >> Pesquisa e Desenvolvimento - World Wide >> R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 >> Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br > From knielson at adaptivecomputing.com Fri Sep 2 10:53:55 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 02 Sep 2011 10:53:55 -0600 (MDT) Subject: [torqueusers] TORQUE 2.5.8 released In-Reply-To: <55d70021-d8fa-4625-b175-2618f4cadb39@mail> Message-ID: TORQUE 2.5.8 is now available for general release. There were no new features added in this release but there were some notable bug fixes. Among them was a fix for the queue resource procct. This resource is intended to be internal to TORQUE. The problem occurred when a job could not be routed to an execution queue and had to be put in a routing queue. The procct count would be interpreted by Moab and Maui as a generic resource. Since the procct generic resource did not exist the job would be stuck. This has been fixed. Another fix was for NVIDIA gpu mode setting. If exclusive_process or default were requested as a GPU mode and TORQUE was using a scheduler the mode would not be changed. This was due to the scheduler stripping of the mode designation. This problem has now been fixed. However, to be able to use the default mode will require the addition of a property to the nodes file of "default" until a new release of Moab is available. The version is at this time not known. Please see the CHANGELOG for all bugs fixed in this version of TORQUE. The release can be downloaded at http://www.clusterresources.com/downloads/torque/torque-2.5.8.tar.gz Thanks to everyone who has made this build possible. Ken Nielson Adaptive Computing From sm4082 at nyu.edu Fri Sep 2 12:04:13 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 2 Sep 2011 14:04:13 -0400 Subject: [torqueusers] Regarding Job Arrays Message-ID: Hello, I recently upgraded Torque on our clusters. Everything seems to be working fine except job arrays feature. I want to know whether it is related to --enable-array configure option. I haven't installed torque with this option turned on. Do I need this option to enable multiple job submission using -t flag with qsub? When I submit the test job something like qsub -t 1 pbs.script my job script contains #!/bin/bash #PBS -S /bin/bash #PBS -V #PBS -l nodes=1:ppn=1,walltime=1:00:00 #PBS -N ivan cd /home/xxx echo "${PBS_ARRAYID}" exit 0; When I submit the job it seems it's recognizing the -t flag as I get the jobid as 5415713[].hpc0.local My error file contains set: No match. set: No match. Output file contains nothing. I would really appreciate it if someone could provide some insight into this. Thank you in advance. Sreedhar. From jjc at iastate.edu Fri Sep 2 13:49:29 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 2 Sep 2011 14:49:29 -0500 Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: References: Message-ID: You'll need and MPI application to use multiple nodes. Perhaps this application would be Clustalw-MPI . It looks like this is available at: http://www.bii.a-star.edu.sg/achievements/applications/clustalw/download.php Information on this can be found at: http://www.bii.a-star.edu.sg/docs/software/README.clustalw-mpi Torque just reserves node, you need something like MPi and program written in MPI to use multiple nodes. If you need a suggestion on MPI, I use OpenMPI because it installs easily and can use multiple network interconnects, it also works well with Torque. Ethernet works, but is slow, however, that is likely what you have. We use Infiniband and Myrinet networks in addition to Ethernet. They give much better performance for our workloads, but the cards and switches are very expensive. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Guilherme Rocha Sent: Friday, September 02, 2011 4:18 AM To: torqueusers at supercluster.org Subject: [torqueusers] Success setting up a new Torque Environment in University Hello folks, My name is Guilherme and this is my first post here. Thanks for this great project. We're setting-up a Torque Cluster with 23 nodes and will be used to bioinformatics tasks. I'm completely newbie to all of this, so after hard steps of troubleshooting, we finally received good news in logs. We did a small alginment using clustalw. But we have some doubts about how to use a program in parallel, like: Question 1) I need to have clustalw (or the script programs installed in all nodes?) Clustalw is only installed in head node by now. Question 2: Can we use/open GUI program's interfaces to work using torque? Question 3: When I submit a job, even requesting 10 nodes, clustal are being runned only in one node. What can be wrong? thanks in advance thanks in advance, -- -- Guilherme Rocha GF7 Doc & Systems - Solu??es Tecnol?gicas Pesquisa e Desenvolvimento - World Wide R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110902/50482870/attachment-0001.html From sm4082 at nyu.edu Fri Sep 2 13:54:53 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 2 Sep 2011 15:54:53 -0400 Subject: [torqueusers] Regarding Job Arrays In-Reply-To: References: Message-ID: <3D484E83-6FC3-402A-9CD1-47329992AB2D@nyu.edu> Hi, Same job runs ok on different cluster which was upgraded at the same time. Not sure what could be the problem. Anyone has any idea? Thanks, Sreedhar. On Sep 2, 2011, at 2:04 PM, Sreedhar Manchu wrote: > Hello, > > I recently upgraded Torque on our clusters. Everything seems to be working fine except job arrays feature. I want to know whether it is related to --enable-array configure option. I haven't installed torque with this option turned on. Do I need this option to enable multiple job submission using -t flag with qsub? > > When I submit the test job something like > > qsub -t 1 pbs.script > > my job script contains > > #!/bin/bash > #PBS -S /bin/bash > #PBS -V > #PBS -l nodes=1:ppn=1,walltime=1:00:00 > #PBS -N ivan > cd /home/xxx > echo "${PBS_ARRAYID}" > exit 0; > > When I submit the job it seems it's recognizing the -t flag as I get the jobid as > 5415713[].hpc0.local > > My error file contains > > set: No match. > set: No match. > > Output file contains nothing. > > I would really appreciate it if someone could provide some insight into this. > > Thank you in advance. > Sreedhar. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Fri Sep 2 20:43:20 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 2 Sep 2011 22:43:20 -0400 Subject: [torqueusers] Regarding Job Arrays In-Reply-To: <3D484E83-6FC3-402A-9CD1-47329992AB2D@nyu.edu> References: <3D484E83-6FC3-402A-9CD1-47329992AB2D@nyu.edu> Message-ID: <5DEB749F-7EF4-425E-8485-BBBE07217F6C@nyu.edu> Hi, This error was from the prologue and epilogue scripts on our systems. This error comes from csh scripts which explains the reason for the error as our prologue and epilogue scripts are csh scripts. Sreedhar. On Sep 2, 2011, at 3:54 PM, Sreedhar Manchu wrote: > Hi, > > Same job runs ok on different cluster which was upgraded at the same time. Not sure what could be the problem. > > Anyone has any idea? > > Thanks, > Sreedhar. > > On Sep 2, 2011, at 2:04 PM, Sreedhar Manchu wrote: > >> Hello, >> >> I recently upgraded Torque on our clusters. Everything seems to be working fine except job arrays feature. I want to know whether it is related to --enable-array configure option. I haven't installed torque with this option turned on. Do I need this option to enable multiple job submission using -t flag with qsub? >> >> When I submit the test job something like >> >> qsub -t 1 pbs.script >> >> my job script contains >> >> #!/bin/bash >> #PBS -S /bin/bash >> #PBS -V >> #PBS -l nodes=1:ppn=1,walltime=1:00:00 >> #PBS -N ivan >> cd /home/xxx >> echo "${PBS_ARRAYID}" >> exit 0; >> >> When I submit the job it seems it's recognizing the -t flag as I get the jobid as >> 5415713[].hpc0.local >> >> My error file contains >> >> set: No match. >> set: No match. >> >> Output file contains nothing. >> >> I would really appreciate it if someone could provide some insight into this. >> >> Thank you in advance. >> Sreedhar. >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Fri Sep 2 21:28:13 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sat, 3 Sep 2011 13:28:13 +1000 Subject: [torqueusers] qsub patch for job script arguments. In-Reply-To: <20110829224654.GC2998@lbl.gov> References: <201108272246.53285.roy.dragseth@cc.uit.no> <67423d1e-2de2-4be3-b323-e7cae64e8464@mail> <007DECE986B47F4EABF823C1FBB19C620102AC2371C3@exvic-mbx04.nexus.csiro.au> <20110829224654.GC2998@lbl.gov> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B239A3D5@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Michael Jennings [mailto:mej at lbl.gov] > Sent: Tuesday, 30 August 2011 8:47 AM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] qsub patch for job script arguments. > > On Tuesday, 30 August 2011, at 08:43:09 (+1000), > Gareth.Williams at csiro.au wrote: > > > I like the -- idea, but I think it is more natural to have (or at > least allow) -- before the job script: > > qsub -- run.sh arg1 arg2 > > Before is better to signify that qsub options are complete and the > rest of the line is the job script and its options. After still > signifies this but splits the job script and its options unnecessarily. > > That doesn't work. The job script is an argument to qsub. The > convention is that -- halts option parsing; the command is supposed to > ignore anything that comes after --. If qsub ignores the jobscript > name, not much will get done. ;-) > > Using -- is by far the simplest and most standard approach IMHO. > Either that, or just ignore everything after the jobscript name. Hi Michael, I've always read information about -- usage as halting 'option' parsing, as you say, but I guess I assumed there is usually a different between 'options' (and their associated required/optional arguments) and other arguments. Eg. In the man page for (gnu) time: time [options] command [arguments...] -snip- GNU Standard Options -snip- -- Terminate option list. ie. '--' is the last 'option', so that 'command' and [arguments...] get used without being further interpreted by the argument parsing. I think in this case, qsub [options] [script [script_args...]] Is more intuitive than, qsub [options] [script [-- script_args...]] Indeed -- need not be necessary unless qsub can't recognise which option is the script and aggressively swallows up args after the script. (compare: 'ls / -l', 'ls -- / -l' and 'ls -l /') Gareth > > -- > Michael Jennings > Linux Systems and Cluster Engineer > High-Performance Computing Services > Bldg 50B-3209E W: 510-495-2687 > MS 050C-3396 F: 510-486-8615 From samuel at unimelb.edu.au Sun Sep 4 22:37:18 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 05 Sep 2011 14:37:18 +1000 Subject: [torqueusers] qsub patch for job script arguments. In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102B239A3D5@exvic-mbx04.nexus.csiro.au> References: <201108272246.53285.roy.dragseth@cc.uit.no> <67423d1e-2de2-4be3-b323-e7cae64e8464@mail> <007DECE986B47F4EABF823C1FBB19C620102AC2371C3@exvic-mbx04.nexus.csiro.au> <20110829224654.GC2998@lbl.gov> <007DECE986B47F4EABF823C1FBB19C620102B239A3D5@exvic-mbx04.nexus.csiro.au> Message-ID: <4E6451FE.4020506@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/09/11 13:28, Gareth.Williams at csiro.au wrote: > I've always read information about -- usage as halting > 'option' parsing, as you say, but I guess I assumed > there is usually a different between 'options' (and > their associated required/optional arguments) and > other arguments. Eg. In the man page for (gnu) time: That's a good spot - it would be good to see if there is more precedence for this usage versus that implemented by Roy's patch. It's generally good to reuse behaviour that's already in use elsewhere. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk5kUf4ACgkQO2KABBYQAh/opACggEA6MmtyUvuAoKkGSypZpqs8 yFUAn1yOP+1baR2eJOUyl6DOqEgove+6 =GDny -----END PGP SIGNATURE----- From roy.dragseth at cc.uit.no Mon Sep 5 06:40:06 2011 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Mon, 5 Sep 2011 14:40:06 +0200 Subject: [torqueusers] qsub patch for job script arguments. In-Reply-To: <4E6451FE.4020506@unimelb.edu.au> References: <201108272246.53285.roy.dragseth@cc.uit.no> <007DECE986B47F4EABF823C1FBB19C620102B239A3D5@exvic-mbx04.nexus.csiro.au> <4E6451FE.4020506@unimelb.edu.au> Message-ID: <201109051440.07006.roy.dragseth@cc.uit.no> On Monday, September 05, 2011 06:37:18 Christopher Samuel wrote: > On 03/09/11 13:28, Gareth.Williams at csiro.au wrote: > > I've always read information about -- usage as halting > > 'option' parsing, as you say, but I guess I assumed > > there is usually a different between 'options' (and > > their associated required/optional arguments) and > > > other arguments. Eg. In the man page for (gnu) time: > That's a good spot - it would be good to see if there is > more precedence for this usage versus that implemented by > Roy's patch. It's generally good to reuse behaviour that's > already in use elsewhere. Would also happily go for a solution like qsub [options] runscript.sh [runscript options] but wouldn't that be a change in the default behaviour? I have the impression that AC is very reluctant to accept anything that changes the default. My suggested patch is an attempt to keep the default behaviour of not accepting any runscript options. qsub will then only allow them if you explicitly state you want them by using '--'. r. From soubari at yahoo.com Tue Sep 6 08:07:45 2011 From: soubari at yahoo.com (sam oubari) Date: Tue, 6 Sep 2011 07:07:45 -0700 (PDT) Subject: [torqueusers] Help! One Puzzle At a Time... In-Reply-To: <1314572743.54682.YahooMailNeo@web110608.mail.gq1.yahoo.com> References: <1314572743.54682.YahooMailNeo@web110608.mail.gq1.yahoo.com> Message-ID: <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> Hello, ? I am no expert at TORQUE and one key puzzle for us is why, on occasions, a waiting job moves from H to Q but not R when it's scheduled time comes?? When I attempt to force it with qrun I get: ? qrun: Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available 3030.naboo.linnbenton.edu Below is the output of 'printserverdb' and 'qnodes' during the "freeze".? To fix, I had to kill mom, restart it, then qrun the first Q job. ? Any hints would be greatly appreciated.? Thx! Sam. ? PS. I've provided more details on 8/28/11. ? ? ------ Sam Oubari, Manager of Systems & Application Programming Linn-Benton Community College -- Information Services 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321 Tel: 541-917-4355/Fax: 541-917-4379 ? ====== ? # printserverdb --------------------------------------------------- numjobs:??????????????? 26 numque:???????? 5 jobidnumber:??????????? 3575 savetm:???????? 1314100391 --attributes-- scheduling = True max_running = 23 total_jobs = 22 state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0 default_queue = sys_tst log_events = 511 mail_from = adm query_other_jobs = False resources_assigned.nodect = 0 scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 mom_job_sync = False pbs_version = 2.5.6 keep_completed = 600 allow_node_submit = True next_job_number = 1 net_counter = 7 1 0 ? # qnodes naboo ???? state = down ???? np = 40 ???? ntype = cluster ???? status = rectime=1315288785,varattr=,jobs=3448.naboo.linnbenton.edu 3449.naboo.linnbenton.edu 3450. naboo.linnbenton.edu,state=free,netload=1345146873471,gres=,loadave=0.08,ncpus=4,physmem=17040092kb,avai lmem=23485296kb,totmem=29739432kb,idletime=459327,nusers=5,nsessions=115,sessions=361 363 365 367 369 37 1 373 375 377 379 381 383 385 387 389 391 393 395 397 399 401 407 409 413 422 424 426 428 430 432 434 43 6 438 440 442 444 446 448 450 452 454 456 460 462 466 471 474 476 479 481 483 485 487 489 491 493 495 49 7 499 501 503 505 507 518 520 522 527 529 531 533 535 537 539 546 548 550 552 554 556 558 560 562 564 56 7 578 585 587 589 660 662 956 960 1637 1648 1657 1863 1891 5763 5839 5875 13067 18926 18986 19028 24492 24541 24588 24639 24684 24740 24787 29226 29631 30517 30521,uname=Linux naboo.linnbenton.edu 2.6.18-238. 12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 x86_64,opsys=linux ???? gpus = 0 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110906/6ba38377/attachment-0001.html From gus at ldeo.columbia.edu Tue Sep 6 08:25:43 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 06 Sep 2011 10:25:43 -0400 Subject: [torqueusers] Help! One Puzzle At a Time... In-Reply-To: <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> References: <1314572743.54682.YahooMailNeo@web110608.mail.gq1.yahoo.com> <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> Message-ID: <4E662D67.9040805@ldeo.columbia.edu> Regarding the long time in Q state after H state. If you are using the maui scheduler, this may be due to the default defertime of 1 hour. In this case, try setting it to less. For instance, if you want it to be one minute, add this line: DEFERTIME 00:01:00 to your ${MAUI}/maui.cfg file and restart maui. See also: http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php Not sure if I understood it right, but for the 'resource temporarily unavailable' problem, qnodes is reporting the 'naboo' node as 'down', hence unavailable. It may need a reboot. I hope this helps, Gus Correa sam oubari wrote: > Hello, > > I am no expert at TORQUE and one key puzzle for us is why, on occasions, > a waiting job moves from H to Q but not R when it's scheduled time > comes? When I attempt to force it with qrun I get: > > qrun: Resource temporarily unavailable MSG=job allocation request > exceeds currently available cluster nodes, 1 requested, 0 available > 3030.naboo.linnbenton.edu > Below is the output of 'printserverdb' and 'qnodes' during the > "freeze". To fix, I had to kill mom, restart it, then qrun the first Q job. > > Any hints would be greatly appreciated. Thx! Sam. > > PS. I've provided more details on 8/28/11. > > > ------ > Sam Oubari, Manager of Systems & Application Programming > Linn-Benton Community College -- Information Services > 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321 > Tel: 541-917-4355/Fax: 541-917-4379 > > ====== > > # printserverdb > --------------------------------------------------- > numjobs: 26 > numque: 5 > jobidnumber: 3575 > savetm: 1314100391 > --attributes-- > scheduling = True > max_running = 23 > total_jobs = 22 > state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0 > default_queue = sys_tst > log_events = 511 > mail_from = adm > query_other_jobs = False > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = False > pbs_version = 2.5.6 > keep_completed = 600 > allow_node_submit = True > next_job_number = 1 > net_counter = 7 1 0 > > # qnodes > naboo > state = down > np = 40 > ntype = cluster > status = rectime=1315288785,varattr=,jobs=3448.naboo.linnbenton.edu > 3449.naboo.linnbenton.edu 3450. > naboo.linnbenton.edu,state=free,netload=1345146873471,gres=,loadave=0.08,ncpus=4,physmem=17040092kb,avai > lmem=23485296kb,totmem=29739432kb,idletime=459327,nusers=5,nsessions=115,sessions=361 > 363 365 367 369 37 > 1 373 375 377 379 381 383 385 387 389 391 393 395 397 399 401 407 409 > 413 422 424 426 428 430 432 434 43 > 6 438 440 442 444 446 448 450 452 454 456 460 462 466 471 474 476 479 > 481 483 485 487 489 491 493 495 49 > 7 499 501 503 505 507 518 520 522 527 529 531 533 535 537 539 546 548 > 550 552 554 556 558 560 562 564 56 > 7 578 585 587 589 660 662 956 960 1637 1648 1657 1863 1891 5763 5839 > 5875 13067 18926 18986 19028 24492 > 24541 24588 24639 24684 24740 24787 29226 29631 30517 30521,uname=Linux > naboo.linnbenton.edu 2.6.18-238. > 12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 x86_64,opsys=linux > gpus = 0 > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From toth at fi.muni.cz Tue Sep 6 08:26:05 2011 From: toth at fi.muni.cz (=?windows-1252?Q?=22Mgr=2E_=8Aimon_T=F3th=22?=) Date: Tue, 06 Sep 2011 16:26:05 +0200 Subject: [torqueusers] Help! One Puzzle At a Time... In-Reply-To: <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> References: <1314572743.54682.YahooMailNeo@web110608.mail.gq1.yahoo.com> <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> Message-ID: <4E662D7D.6050307@fi.muni.cz> > naboo > state = down Well, there's your problem :-) -- Mgr. Simon Toth From marc.mendezbermond at gmail.com Sun Sep 4 07:08:17 2011 From: marc.mendezbermond at gmail.com (Marc Mendez-Bermond) Date: Sun, 04 Sep 2011 15:08:17 +0200 Subject: [torqueusers] Torque over-limit submission accepted. Message-ID: <4E637841.8070701@gmail.com> Hi all, I am fighting with a *new* installation of Torque coupled with Maui where 3 queues are defined and when I try to submit a job to the "small" queue with more cores than its max allowed, the job is accepted. For example, the 'small resources_max.ncpus = 12' and the queue accepts '-l nodes=2:ppn=12 -q small' requests ... It looks like the nodes value only is considered which is quite confirmed if I try the following : '-l nodes=14:ppn=12' is being routed to the "medium" queue defined as 'set queue medium resources_max.ncpus = 64'. Versions are : - torque-2.5.7-1.el5.1 (EPEL5 RPMs for RHEL/CENTOS 5) - maui-3.3-4.el5 (https://svnweb.cern.ch/trac/maui) Its configuration is detailed below and I think Maui is out of the cause as using the pbs_sched will lead to the same issue. Any help appreciated ! Regards, M. ====== # # Create queues and set their attributes. # # # Create and define queue medium # create queue medium set queue medium queue_type = Execution set queue medium max_queuable = 100 set queue medium resources_max.ncpus = 64 set queue medium resources_max.nodect = 64 set queue medium resources_min.ncpus = 13 set queue medium resources_default.walltime = 48:00:00 set queue medium enabled = True set queue medium started = True # # Create and define queue large # create queue large set queue large queue_type = Execution set queue large max_queuable = 100 set queue large resources_max.ncpus = 168 set queue large resources_max.nodect = 168 set queue large resources_min.ncpus = 65 set queue large resources_default.walltime = 24:00:00 set queue large enabled = True set queue large started = True # # Create and define queue small # create queue small set queue small queue_type = Execution set queue small max_queuable = 100 set queue small resources_max.ncpus = 12 set queue small resources_max.nodect = 12 set queue small resources_default.walltime = 96:00:00 set queue small enabled = True set queue small started = True # # Create and define queue portalq # create queue portalq set queue portalq queue_type = Route set queue portalq route_destinations = small set queue portalq route_destinations += medium set queue portalq route_destinations += large set queue portalq enabled = True set queue portalq started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = master.mycluster.org set server managers = root at master.mycluster.org set server operators = root at master.mycluster.org set server default_queue = portalq set server log_events = 511 set server mail_from = adm set server resources_default.nodect = 1 set server resources_default.walltime = 00:15:00 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server queue_centric_limits = True set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 183 From marc.mendezbermond at gmail.com Mon Sep 5 13:45:15 2011 From: marc.mendezbermond at gmail.com (Marc Mendez-Bermond) Date: Mon, 05 Sep 2011 21:45:15 +0200 Subject: [torqueusers] Torque over-limit submission accepted. Message-ID: <4E6526CB.40303@gmail.com> Hi all, -= Sorry if repost, but I haven't seen my message appear on the list =- I am fighting with a *new* installation of Torque coupled with Maui where 3 queues are defined and when I try to submit a job to the "small" queue with more cores than its max allowed, the job is accepted. For example, the 'small resources_max.ncpus = 12' and the queue accepts '-l nodes=2:ppn=12 -q small' requests ... It looks like the nodes value only is considered which is quite confirmed if I try the following : '-l nodes=14:ppn=12' is being routed to the "medium" queue defined as 'set queue medium resources_max.ncpus = 64'. Versions are : - torque-2.5.7-1.el5.1 (EPEL5 RPMs for RHEL/CENTOS 5) - maui-3.3-4.el5 (https://svnweb.cern.ch/trac/maui) Its configuration is detailed below and I think Maui is out of the cause as using the pbs_sched will lead to the same issue. Any help appreciated ! Regards, M. ====== # # Create queues and set their attributes. # # # Create and define queue medium # create queue medium set queue medium queue_type = Execution set queue medium max_queuable = 100 set queue medium resources_max.ncpus = 64 set queue medium resources_max.nodect = 64 set queue medium resources_min.ncpus = 13 set queue medium resources_default.walltime = 48:00:00 set queue medium enabled = True set queue medium started = True # # Create and define queue large # create queue large set queue large queue_type = Execution set queue large max_queuable = 100 set queue large resources_max.ncpus = 168 set queue large resources_max.nodect = 168 set queue large resources_min.ncpus = 65 set queue large resources_default.walltime = 24:00:00 set queue large enabled = True set queue large started = True # # Create and define queue small # create queue small set queue small queue_type = Execution set queue small max_queuable = 100 set queue small resources_max.ncpus = 12 set queue small resources_max.nodect = 12 set queue small resources_default.walltime = 96:00:00 set queue small enabled = True set queue small started = True # # Create and define queue portalq # create queue portalq set queue portalq queue_type = Route set queue portalq route_destinations = small set queue portalq route_destinations += medium set queue portalq route_destinations += large set queue portalq enabled = True set queue portalq started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = master.mycluster.org set server managers = root at master.mycluster.org set server operators = root at master.mycluster.org set server default_queue = portalq set server log_events = 511 set server mail_from = adm set server resources_default.nodect = 1 set server resources_default.walltime = 00:15:00 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server queue_centric_limits = True set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 183 From m.meinke at aia.rwth-aachen.de Tue Sep 6 09:12:00 2011 From: m.meinke at aia.rwth-aachen.de (Matthias Meinke) Date: Tue, 06 Sep 2011 17:12:00 +0200 Subject: [torqueusers] Torque over-limit submission accepted. In-Reply-To: <4E637841.8070701@gmail.com> References: <4E637841.8070701@gmail.com> Message-ID: <201109061712.00994.m.meinke@aia.rwth-aachen.de> Hi, I also have a similar problem, where I have setup 3 queues in a similar way as in the email below with torque-3.0.2 and maui-3.3.1. My observation is that when submitting jobs to a default routing queue, maui does not select the correct queue when specifying cores with -l procs=16 when set queue x resources_min.ncpus=8 set queue x resources_max.ncpus=32 are specified for queue x on the pbs server. It works however if -l nodes=4,ppn=4 and set queue x resources_min.nodect=2 set queue x resources_max.nodect=16 is specified. I hope that somebody can give us a hint... Matthias On Sunday 04 September 2011 15:08:17 Marc Mendez-Bermond wrote: > Hi all, > > I am fighting with a *new* installation of Torque coupled with Maui > where 3 queues are defined and when I try to submit a job to the "small" > queue with more cores than its max allowed, the job is accepted. > > For example, the 'small resources_max.ncpus = 12' and the queue accepts > '-l nodes=2:ppn=12 -q small' requests ... It looks like the nodes value > only is considered which is quite confirmed if I try the following : > '-l nodes=14:ppn=12' is being routed to the "medium" queue defined as > 'set queue medium resources_max.ncpus = 64'. > > Versions are : > - torque-2.5.7-1.el5.1 (EPEL5 RPMs for RHEL/CENTOS 5) > - maui-3.3-4.el5 (https://svnweb.cern.ch/trac/maui) > > Its configuration is detailed below and I think Maui is out of the cause > as using the pbs_sched will lead to the same issue. > > Any help appreciated ! > > Regards, > M. > > ====== > > # > # Create queues and set their attributes. > # > # > # Create and define queue medium > # > create queue medium > set queue medium queue_type = Execution > set queue medium max_queuable = 100 > set queue medium resources_max.ncpus = 64 > set queue medium resources_max.nodect = 64 > set queue medium resources_min.ncpus = 13 > set queue medium resources_default.walltime = 48:00:00 > set queue medium enabled = True > set queue medium started = True > # > # Create and define queue large > # > create queue large > set queue large queue_type = Execution > set queue large max_queuable = 100 > set queue large resources_max.ncpus = 168 > set queue large resources_max.nodect = 168 > set queue large resources_min.ncpus = 65 > set queue large resources_default.walltime = 24:00:00 > set queue large enabled = True > set queue large started = True > # > # Create and define queue small > # > create queue small > set queue small queue_type = Execution > set queue small max_queuable = 100 > set queue small resources_max.ncpus = 12 > set queue small resources_max.nodect = 12 > set queue small resources_default.walltime = 96:00:00 > set queue small enabled = True > set queue small started = True > # > # Create and define queue portalq > # > create queue portalq > set queue portalq queue_type = Route > set queue portalq route_destinations = small > set queue portalq route_destinations += medium > set queue portalq route_destinations += large > set queue portalq enabled = True > set queue portalq started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = master.mycluster.org > set server managers = root at master.mycluster.org > set server operators = root at master.mycluster.org > set server default_queue = portalq > set server log_events = 511 > set server mail_from = adm > set server resources_default.nodect = 1 > set server resources_default.walltime = 00:15:00 > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server queue_centric_limits = True > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 183 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ---------------------------------------------------------------------------------- Dr.-Ing. Matthias Meinke Chair of Fluid Mechanics and Institute of Aerodynamics RWTH Aachen University W?llnerstra?e 5a D-52062 Aachen Germany Phone: +49-(0)241-80-95328 Fax: +49-(0)241-80-92257 E-mail: m.meinke at aia.rwth-aachen.de Internet: www.aia.rwth-aachen.de From siegert at sfu.ca Tue Sep 6 12:39:12 2011 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 6 Sep 2011 11:39:12 -0700 Subject: [torqueusers] Torque over-limit submission accepted. In-Reply-To: <4E6526CB.40303@gmail.com> References: <4E6526CB.40303@gmail.com> Message-ID: <20110906183912.GA30931@stikine.sfu.ca> Hi Marc, AFAIK ncpus is not a resource interpreted by torque; it is a relict from ancient history (SMP machines; not clusters) and in my experience usually (I do not know maui) interpreted like nodes=1:ppn=x by the scheduler. Thus, I doubt that ncpus will have any effect on how jobs that request a nodes resource in the form nodes=x:ppn=y are routed. E.g., the job with -l nodes=14:ppn=12 is routed to the medium queue because of the set queue small resources_max.nodect = 12 setting: the job requests 14 nodes. None of the resources_min.ncpus and/or resources_max.ncpus settings come into play since none of your jobs request -l ncpus=x. And if they would those requests would be interpreted in a way that has little to do with the intention of the user. For that reason I prevent the accidental usage of ncpus on my clusters through set server resources_max.ncpus = 0 which causes jobs with a ncpus specification to be rejected rightaway. In short: do not use ncpus. As far as I understand you want to route your jobs depending on the number of requested processors (cores). Since torque-2.5.6 you can use the procct resource to configure this; torque determines procct from the nodes and/or proc specification of the job: requests of the form -l nodes=x:ppn=y and/or -l procs=z result in procct=x*y+z, e.g., -l nodes=2:ppn=12 results in procct=24. Thus, you probably want to set set queue small resources_max.procct = 12 set queue medium resources_min.procct = 13 set queue medium resources_max.procct = 64 set queue large resources_max.procct = 65 set queue large resources_max.procct = 168 and remove all ncpus (and possibly nodect) specifications in qmgr. Cheers, Martin -- Martin Siegert Simon Fraser University On Mon, Sep 05, 2011 at 09:45:15PM +0200, Marc Mendez-Bermond wrote: > Hi all, > > -= Sorry if repost, but I haven't seen my message appear on the list =- > > I am fighting with a *new* installation of Torque coupled with Maui > where 3 queues are defined and when I try to submit a job to the "small" > queue with more cores than its max allowed, the job is accepted. > > For example, the 'small resources_max.ncpus = 12' and the queue accepts > '-l nodes=2:ppn=12 -q small' requests ... It looks like the nodes value > only is considered which is quite confirmed if I try the following : > '-l nodes=14:ppn=12' is being routed to the "medium" queue defined as > 'set queue medium resources_max.ncpus = 64'. > > Versions are : > - torque-2.5.7-1.el5.1 (EPEL5 RPMs for RHEL/CENTOS 5) > - maui-3.3-4.el5 (https://svnweb.cern.ch/trac/maui) > > Its configuration is detailed below and I think Maui is out of the cause > as using the pbs_sched will lead to the same issue. > > Any help appreciated ! > > Regards, > M. > > ====== > > # > # Create queues and set their attributes. > # > # > # Create and define queue medium > # > create queue medium > set queue medium queue_type = Execution > set queue medium max_queuable = 100 > set queue medium resources_max.ncpus = 64 > set queue medium resources_max.nodect = 64 > set queue medium resources_min.ncpus = 13 > set queue medium resources_default.walltime = 48:00:00 > set queue medium enabled = True > set queue medium started = True > # > # Create and define queue large > # > create queue large > set queue large queue_type = Execution > set queue large max_queuable = 100 > set queue large resources_max.ncpus = 168 > set queue large resources_max.nodect = 168 > set queue large resources_min.ncpus = 65 > set queue large resources_default.walltime = 24:00:00 > set queue large enabled = True > set queue large started = True > # > # Create and define queue small > # > create queue small > set queue small queue_type = Execution > set queue small max_queuable = 100 > set queue small resources_max.ncpus = 12 > set queue small resources_max.nodect = 12 > set queue small resources_default.walltime = 96:00:00 > set queue small enabled = True > set queue small started = True > # > # Create and define queue portalq > # > create queue portalq > set queue portalq queue_type = Route > set queue portalq route_destinations = small > set queue portalq route_destinations += medium > set queue portalq route_destinations += large > set queue portalq enabled = True > set queue portalq started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = master.mycluster.org > set server managers = root at master.mycluster.org > set server operators = root at master.mycluster.org > set server default_queue = portalq > set server log_events = 511 > set server mail_from = adm > set server resources_default.nodect = 1 > set server resources_default.walltime = 00:15:00 > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server queue_centric_limits = True > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 183 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Adrian.Sevcenco at cern.ch Tue Sep 6 13:53:09 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Tue, 6 Sep 2011 22:53:09 +0300 Subject: [torqueusers] torque :: configuration problems on localhost Message-ID: <4E667A25.3090203@cern.ch> Hi! I try to install torque on a simple 8 core computer .. i have a simple queue and simple configuration .. the thing is that i keep receiving: set server managers = root at localhost qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host: localhost Any idea about the problem? Thanks! OS : Fedora 14, torque 2.4.11 configuration: create queue alice set queue alice queue_type = Execution set queue alice max_running = 8 set queue alice max_queuable = 2000 set queue alice resources_max.cput = 24:00:00 set queue alice resources_max.walltime = 48:00:00 set queue alice enabled = True set queue alice started = True # # Set server attributes. # ## DEFAULT QUEUE set server default_queue = alice # seting submittins hosts set server scheduling = True # set server acl_host_enable = True set server submit_hosts = localhost set server acl_hosts = localhost set server managers = root at localhost set server managers += asevcenc at localhost set server operators = root at localhost set server operators += asevcenc at localhost set server allow_node_submit = True set server log_events = 511 set server mail_from = torque at sev2.spacescience.ro set server query_other_jobs = True set server resources_default.walltime = 72:00:00 set server scheduler_iteration = 600 set server node_ping_rate = 60 set server node_check_rate = 150 set server tcp_timeout = 20 set server node_pack = True set server poll_jobs = True -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20110906/eebee15c/attachment.bin From gus at ldeo.columbia.edu Tue Sep 6 14:13:43 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 06 Sep 2011 16:13:43 -0400 Subject: [torqueusers] torque :: configuration problems on localhost In-Reply-To: <4E667A25.3090203@cern.ch> References: <4E667A25.3090203@cern.ch> Message-ID: <4E667EF7.3060600@ldeo.columbia.edu> Hi Adrian For what it is worth, here we have the server FQDN name in ${TORQUE}/sever_name. The acl and other Torque configuration parameters point to that FQDN (not to localhost, which is the loopback interface name in /etc/hosts, right?) I hope this helps. Gus Correa Adrian Sevcenco wrote: > Hi! I try to install torque on a simple 8 core computer .. > i have a simple queue and simple configuration .. > the thing is that i keep receiving: > > set server managers = root at localhost > qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host: > localhost > > Any idea about the problem? > Thanks! > > > OS : Fedora 14, torque 2.4.11 > > configuration: > > create queue alice > set queue alice queue_type = Execution > set queue alice max_running = 8 > set queue alice max_queuable = 2000 > set queue alice resources_max.cput = 24:00:00 > set queue alice resources_max.walltime = 48:00:00 > set queue alice enabled = True > set queue alice started = True > > # > # Set server attributes. > # > > ## DEFAULT QUEUE > set server default_queue = alice > > # seting submittins hosts > set server scheduling = True > # set server acl_host_enable = True > > set server submit_hosts = localhost > set server acl_hosts = localhost > > set server managers = root at localhost > set server managers += asevcenc at localhost > > set server operators = root at localhost > set server operators += asevcenc at localhost > > set server allow_node_submit = True > > set server log_events = 511 > set server mail_from = torque at sev2.spacescience.ro > set server query_other_jobs = True > set server resources_default.walltime = 72:00:00 > set server scheduler_iteration = 600 > set server node_ping_rate = 60 > set server node_check_rate = 150 > set server tcp_timeout = 20 > set server node_pack = True > set server poll_jobs = True > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Adrian.Sevcenco at cern.ch Tue Sep 6 14:27:01 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Tue, 6 Sep 2011 23:27:01 +0300 Subject: [torqueusers] torque :: configuration problems on localhost In-Reply-To: <4E667EF7.3060600@ldeo.columbia.edu> References: <4E667A25.3090203@cern.ch> <4E667EF7.3060600@ldeo.columbia.edu> Message-ID: <4E668215.1040609@cern.ch> On 09/06/11 23:13, Gus Correa wrote: > Hi Adrian Hi! > For what it is worth, here we have the server FQDN name > in ${TORQUE}/sever_name. The acl and other Torque > configuration parameters point to that FQDN (not to localhost, > which is the loopback interface name in /etc/hosts, right?) yeah, you are right, but i just trying to make it work as localhost... and actually i sort of did it ... the problem now is that my jobs start waiting in queue (Q state) .. and i have to use qrun to start them .. Thanks! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20110906/f8a0116b/attachment.bin From gus at ldeo.columbia.edu Tue Sep 6 14:38:33 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 06 Sep 2011 16:38:33 -0400 Subject: [torqueusers] torque :: configuration problems on localhost In-Reply-To: <4E667EF7.3060600@ldeo.columbia.edu> References: <4E667A25.3090203@cern.ch> <4E667EF7.3060600@ldeo.columbia.edu> Message-ID: <4E6684C9.8040500@ldeo.columbia.edu> PS - You may also need a ${TORQUE}/server_priv/nodes file, if not yet there: http://www.adaptivecomputing.com/resources/docs/torque/1.5nodeconfig.php Maybe also a ${TORQUE}/mom_priv/config file with a line like this: $pbsserver 'your pbs_server name' Gus Correa wrote: > Hi Adrian > > For what it is worth, here we have the server FQDN name > in ${TORQUE}/sever_name. The acl and other Torque > configuration parameters point to that FQDN (not to localhost, > which is the loopback interface name in /etc/hosts, right?) > > I hope this helps. > Gus Correa > > Adrian Sevcenco wrote: >> Hi! I try to install torque on a simple 8 core computer .. >> i have a simple queue and simple configuration .. >> the thing is that i keep receiving: >> >> set server managers = root at localhost >> qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host: >> localhost >> >> Any idea about the problem? >> Thanks! >> >> >> OS : Fedora 14, torque 2.4.11 >> >> configuration: >> >> create queue alice >> set queue alice queue_type = Execution >> set queue alice max_running = 8 >> set queue alice max_queuable = 2000 >> set queue alice resources_max.cput = 24:00:00 >> set queue alice resources_max.walltime = 48:00:00 >> set queue alice enabled = True >> set queue alice started = True >> >> # >> # Set server attributes. >> # >> >> ## DEFAULT QUEUE >> set server default_queue = alice >> >> # seting submittins hosts >> set server scheduling = True >> # set server acl_host_enable = True >> >> set server submit_hosts = localhost >> set server acl_hosts = localhost >> >> set server managers = root at localhost >> set server managers += asevcenc at localhost >> >> set server operators = root at localhost >> set server operators += asevcenc at localhost >> >> set server allow_node_submit = True >> >> set server log_events = 511 >> set server mail_from = torque at sev2.spacescience.ro >> set server query_other_jobs = True >> set server resources_default.walltime = 72:00:00 >> set server scheduler_iteration = 600 >> set server node_ping_rate = 60 >> set server node_check_rate = 150 >> set server tcp_timeout = 20 >> set server node_pack = True >> set server poll_jobs = True >> >> >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Tue Sep 6 14:41:39 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 06 Sep 2011 16:41:39 -0400 Subject: [torqueusers] torque :: configuration problems on localhost In-Reply-To: <4E668215.1040609@cern.ch> References: <4E667A25.3090203@cern.ch> <4E667EF7.3060600@ldeo.columbia.edu> <4E668215.1040609@cern.ch> Message-ID: <4E668583.2010907@ldeo.columbia.edu> It may work with localhost for a single computer (not our case here). Did you start the scheduler (service pbs_sched start, or service maui start if you use maui)? Adrian Sevcenco wrote: > On 09/06/11 23:13, Gus Correa wrote: >> Hi Adrian > Hi! > >> For what it is worth, here we have the server FQDN name >> in ${TORQUE}/sever_name. The acl and other Torque >> configuration parameters point to that FQDN (not to localhost, >> which is the loopback interface name in /etc/hosts, right?) > yeah, you are right, but i just trying to make it work as localhost... > and actually i sort of did it ... > > the problem now is that my jobs start waiting in queue (Q state) .. and > i have to use qrun to start them .. > > Thanks! > Adrian > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From bino at coc.ufrj.br Tue Sep 6 15:15:57 2011 From: bino at coc.ufrj.br (Albino A. Aveleda) Date: Tue, 6 Sep 2011 18:15:57 -0300 (BRT) Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <241466084.4540.1315343716864.JavaMail.root@mailhost.coc.ufrj.br> Message-ID: <1871365140.4546.1315343757734.JavaMail.root@mailhost.coc.ufrj.br> Hi All, I am trying to compile torque-2.5.8 with --enable-gui option on SLES11-SP1 but I always receive the error below. The tcl and tk packages are installed. tcl-32bit-8.5.5-2.81 tcl-devel-8.5.5-2.81 tclx-8.4-470.22 tcl-8.5.5-2.81 tk-32bit-8.5.5-3.12 tk-8.5.5-3.12 tk-devel-8.5.5-3.12 I have problem with Tk_Init. The error message is below. [... skip ...] configure: Starting Tcl/Tk configuration checking for Tcl configuration... found /usr/lib64/tclConfig.sh checking for existence of /usr/lib64/tclConfig.sh... loading checking for Tcl public headers... /usr/include checking for tclx configuration... none checking for Tk configuration... found /usr/lib64/tkConfig.sh checking for existence of /usr/lib64/tkConfig.sh... loading checking for Tk public headers... /usr/include checking for Tcl_Init... yes checking for Tk_Init... no configure: Your Tk install is broken. Disabling Tk support. checking whether to include the GUI-clients... configure: error: cannot build GUI without Tk library How do I fix this? Best regards, Bibo From gus at ldeo.columbia.edu Tue Sep 6 15:34:56 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 06 Sep 2011 17:34:56 -0400 Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <1871365140.4546.1315343757734.JavaMail.root@mailhost.coc.ufrj.br> References: <1871365140.4546.1315343757734.JavaMail.root@mailhost.coc.ufrj.br> Message-ID: <4E669200.6020505@ldeo.columbia.edu> Hi Albino Do you need perhaps the 64-bit Tcl/Tk packages, instead of 32-bit? Since configure is looking for Tcl/Tk items in /usr/lib64, I guess your machine is 64-bit, right? Anyway, just a guess. I hope it helps, Gus Correa Albino A. Aveleda wrote: > Hi All, > > I am trying to compile torque-2.5.8 with --enable-gui option on SLES11-SP1 > but I always receive the error below. The tcl and tk packages are installed. > > tcl-32bit-8.5.5-2.81 > tcl-devel-8.5.5-2.81 > tclx-8.4-470.22 > tcl-8.5.5-2.81 > tk-32bit-8.5.5-3.12 > tk-8.5.5-3.12 > tk-devel-8.5.5-3.12 > > I have problem with Tk_Init. The error message is below. > [... skip ...] > configure: Starting Tcl/Tk configuration > checking for Tcl configuration... found /usr/lib64/tclConfig.sh > checking for existence of /usr/lib64/tclConfig.sh... loading > checking for Tcl public headers... /usr/include > checking for tclx configuration... none > checking for Tk configuration... found /usr/lib64/tkConfig.sh > checking for existence of /usr/lib64/tkConfig.sh... loading > checking for Tk public headers... /usr/include > checking for Tcl_Init... yes > checking for Tk_Init... no > configure: Your Tk install is broken. Disabling Tk support. > checking whether to include the GUI-clients... configure: error: cannot build GUI without Tk library > > How do I fix this? > > Best regards, > Bibo > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jkusznir at gmail.com Tue Sep 6 18:11:27 2011 From: jkusznir at gmail.com (Jim Kusznir) Date: Tue, 6 Sep 2011 17:11:27 -0700 Subject: [torqueusers] torque problem: submitting jobs from nodes Message-ID: Hi All: I've got a user who's trying to have his jobs checkpoint and re-queue themselves at the end of their runtime so as to allow it to run with shorter walltime limits (and thus help balance cluster usage and fair share, etc). Of course, for this to work, he needs to be able to submit jobs (qsub) from the comptute nodes. I figured this should be no big deal, and check my qmgr settings: Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue default # create queue default set queue default queue_type = Execution set queue default resources_max.walltime = 24:00:00 set queue default resources_default.nodes = 1 set queue default resources_default.walltime = 01:00:00 set queue default enabled = True set queue default started = True # # Create and define queue long # create queue long set queue long queue_type = Execution set queue long enabled = True set queue long started = True # # Set server attributes. # set server scheduling = True set server acl_host_enable = False set server acl_user_enable = False set server managers = kusznir at aeolus.wsu.edu set server managers += maui at aeolus.wsu.edu set server managers += root at aeolus.wsu.edu set server default_queue = default set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server resources_available.nodect = 288 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 304175 Unfortunately, when one tries to submit a job from a compute node, one gets: [kusznir at compute-0-20 ~]$ qsub -I -l nodes=1 qsub: Bad UID for job execution MSG=ruserok failed validating kusznir/kusznir from compute-0-20.local What's going on here? As far as I can read, all the settings are set to allow this to work. What's wrong? Thanks! --Jim From jpeltier at sfu.ca Tue Sep 6 18:53:33 2011 From: jpeltier at sfu.ca (James A. Peltier) Date: Tue, 6 Sep 2011 17:53:33 -0700 (PDT) Subject: [torqueusers] torque problem: submitting jobs from nodes In-Reply-To: Message-ID: <451386732.78516.1315356813385.JavaMail.root@jaguar10.sfu.ca> ----- Original Message ----- | Hi All: | | I've got a user who's trying to have his jobs checkpoint and re-queue | themselves at the end of their runtime so as to allow it to run with | shorter walltime limits (and thus help balance cluster usage and fair | share, etc). Of course, for this to work, he needs to be able to | submit jobs (qsub) from the comptute nodes. I figured this should be | no big deal, and check my qmgr settings: | | Qmgr: print server | # | # Create queues and set their attributes. | # | # | # Create and define queue default | # | create queue default | set queue default queue_type = Execution | set queue default resources_max.walltime = 24:00:00 | set queue default resources_default.nodes = 1 | set queue default resources_default.walltime = 01:00:00 | set queue default enabled = True | set queue default started = True | # | # Create and define queue long | # | create queue long | set queue long queue_type = Execution | set queue long enabled = True | set queue long started = True | # | # Set server attributes. | # | set server scheduling = True | set server acl_host_enable = False | set server acl_user_enable = False | set server managers = kusznir at aeolus.wsu.edu | set server managers += maui at aeolus.wsu.edu | set server managers += root at aeolus.wsu.edu | set server default_queue = default | set server log_events = 511 | set server mail_from = adm | set server query_other_jobs = True | set server resources_available.nodect = 288 | set server scheduler_iteration = 600 | set server node_check_rate = 150 | set server tcp_timeout = 6 | set server next_job_number = 304175 | | | Unfortunately, when one tries to submit a job from a compute node, one | gets: | [kusznir at compute-0-20 ~]$ qsub -I -l nodes=1 | qsub: Bad UID for job execution MSG=ruserok failed validating | kusznir/kusznir from compute-0-20.local | | What's going on here? As far as I can read, all the settings are set | to allow this to work. What's wrong? | | Thanks! | --Jim | _______________________________________________ | torqueusers mailing list | torqueusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/torqueusers set server allow_node_submit = True -- James A. Peltier IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier I will do the best I can with the talent I have From soubari at yahoo.com Tue Sep 6 21:50:11 2011 From: soubari at yahoo.com (sam oubari) Date: Tue, 6 Sep 2011 20:50:11 -0700 (PDT) Subject: [torqueusers] Help! One Puzzle At a Time... In-Reply-To: <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> References: <1314572743.54682.YahooMailNeo@web110608.mail.gq1.yahoo.com> <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> Message-ID: <1315367411.33284.YahooMailNeo@web110605.mail.gq1.yahoo.com> Hi Gus, ? I am using pbs_sched and all is one one server.? To clarify, on occasions,?jobs stay in Q until I bounce MOM.? I am pretty sure something is wrong with my only node.? Sam. ? Gus Correa gus at ldeo.columbia.edu wrote on Tue Sep 6 08:25:43 MDT 2011 : ? Regarding the long time in Q state after H state. If you are using the maui scheduler, this may be due to the default defertime of 1 hour. In this case, try setting it to less. For instance, if you want it to be one minute, add this line: DEFERTIME 00:01:00 to your ${MAUI}/maui.cfg file and restart maui. See also: http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php Not sure if I understood it right, but for the 'resource temporarily unavailable' problem, qnodes is reporting the 'naboo' node as 'down', hence unavailable. It may need a reboot. I hope this helps, Gus Correa -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110906/e1b53533/attachment.html From Eirikur.Hjartarson at decode.is Wed Sep 7 02:22:31 2011 From: Eirikur.Hjartarson at decode.is (=?iso-8859-1?Q?Eir=EDkur_Hjartarson?=) Date: Wed, 7 Sep 2011 08:22:31 +0000 Subject: [torqueusers] Two problems with a routing queue Message-ID: <4C3D9F9382FE07458AF8E79B234B4BD8AD64FD@smbx.decode.is> Hi, In order to limit the number of jobs that maui considers for scheduling, we have a routing queue setup, # # Create and define queue exec # create queue exec set queue exec queue_type = Route set queue exec route_destinations = real_exec set queue exec route_held_jobs = False set queue exec enabled = True set queue exec started = True # # Create and define queue real_exec # create queue real_exec set queue real_exec queue_type = Execution set queue real_exec max_user_queuable = 800 set queue real_exec from_route_only = True set queue real_exec resources_default.nodes = 1 set queue real_exec enabled = True set queue real_exec started = True (800 is a bit higher than the number of CPUs in the cluster) There are two problems that we have experienced with this setup. 1. A job (id: 28379062), that is still on the "exec" queue and depends on another job (id: 28379059) that finishes *before* the job (id: 28379062) is put on the "real_exec" queue will generate the following error mail, when it (id: 28379062) is transferred to the "real_exec" queue. --- PBS Job Id: 28379062.lpbs2.decode.is Job Name: bambino_22892 Aborted by PBS Server Dependency request for job rejected by 28379059.lpbs2.decode.is Unknown Job Id Job held for unknown job dep, use 'qrls' to release --- Is there any way to solve this problem, other than setting the keep_completed attribute to some non-zero value? The problem with the keep_completed attribute is that we (think we) have to set it to a big value, say, one day. 2. The "real_exec" queue may get filled up with jobs that all depend on a job that is still on the "exec" queue. It seems possible to me that the route_held_jobs attribute only applies to user holds. If that is correct, would it be possible to let it also apply to system holds? Regards, -- Eir?kur Hjartarson From gus at ldeo.columbia.edu Wed Sep 7 07:16:03 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 07 Sep 2011 09:16:03 -0400 Subject: [torqueusers] Help! One Puzzle At a Time... In-Reply-To: <1315367411.33284.YahooMailNeo@web110605.mail.gq1.yahoo.com> References: <1314572743.54682.YahooMailNeo@web110608.mail.gq1.yahoo.com> <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> <1315367411.33284.YahooMailNeo@web110605.mail.gq1.yahoo.com> Message-ID: <4E676E93.9060907@ldeo.columbia.edu> Hi Sam I added your original message below, so that other people can read it. Do you have a ${TORQUE}/mom_priv/config file, pointing to your pbs_server, probably: $pbsserver naboo [Assuming naboo is the server name in ${TORQUE}/sever_name.] Did you restart pbs_server after you modified ${TORQUE}/server_priv/nodes, etc? (service pbs_server restart) Anything on your /var/log/messages file telling why pbs_mom dies? I hope this helps, Gus sam oubari wrote: > Hi Gus, > > I am using pbs_sched and all is one one server. To clarify, on > occasions, jobs stay in Q until I bounce MOM. I am pretty sure > something is wrong with my only node. Sam. > ** > *Gus Correa* gus at ldeo.columbia.edu > wrote > on /Tue Sep 6 08:25:43 MDT 2011/ : > > Regarding the long time in Q state after H state. > If you are using the maui scheduler, this may be due to the default > defertime of 1 hour. > In this case, try setting it to less. > For instance, if you want it to be one minute, add this line: > DEFERTIME 00:01:00 > to your ${MAUI}/maui.cfg file and restart maui. > > See also: > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > Not sure if I understood it right, but > for the 'resource temporarily unavailable' problem, > qnodes is reporting the 'naboo' node as 'down', hence unavailable. > It may need a reboot. > > I hope this helps, > Gus Correa > sam oubari wrote: > Hello, > I hope someone can help: > 1) When we were running 2.4.11, every few weeks, pbs_sched would die. We > upgraded using a fresh install to 2.5.6 about two months ago, and we > configured it like we did with 2.4.11 using: > ./configure --enable-docs --disable-dependency-tracking > --disable-libtool-lock --with-scp > Now, almost every Sunday at 11pm (we do fire up a few jobs but we do > that every AM and PM), mom dies or defunct, eg: > $ ps -ef|grep pbs > root 6704 1 0 Jul13 ? 00:02:18 /usr/local/sbin/pbs_server > root 6910 1 0 Jul13 ? 00:00:56 /usr/local/sbin/pbs_sched > rpt_devl 8871 10997 0 Jul31 ? 00:00:00 [pbs_mom] > root 10997 1 4 Jul25 ? 07:48:14 /usr/local/sbin/pbs_mom > Usually, at that time, there are 4 jobs waiting to execute to perform > clean up on 4 DBs, and that seems to get MOM stuck. > See Dump-1 at the bottom. Our current config is shown below Dump-1 as > Dump-2. > 2) Both 2.4.x and 2.5.x occasionally don't schedule a waiting job, if I > recall, it goes from W to Q but not R. When that happens, I force it > with QRUN. > 3) I manually had created server_priv/nodes (I just made np=40, it used > to be 20): > # echo "naboo np=40">/var/spool/torque/server_priv/nodes > But I still cannot verify within qmgr: > # qmgr > list nodes > No Active Nodes, nothing done. > 4) I manually configured by starting with "pbs_server -t create", but > now I am missing $TORQUE_HOME/mom_priv/config. For my simple install, is > it required? > 5) Speaking of qmgr, most the time when I enter it quits without an > output after I issue my 1st command. I re-enter immediately, then it > accepts all my commands with no problem. This has been true for 2.4.x > and 2.5.x. > Any ideas? If things don't improve, I am planning to revert back to > 2.4.x. Thx! Sam. > ------ > Sam Oubari, Manager of Systems & Application Programming > Linn-Benton Community College -- Information Services > 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321 > Tel: 541-917-4355/Fax: 541-917-4379 > *********** Dump-1: > 07/31/2011 22:55:37;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.6, loglevel = 0 > 07/31/2011 23:00:00;0001; pbs_mom;Job;TMomFinalizeJob3;job > 6780.naboo.linnbenton.edu started, pid = 8294 > 07/31/2011 23:00:08;0080; > pbs_mom;Job;6780.naboo.linnbenton.edu;scan_for_terminated: job > 6780.naboo.linnbenton.edu task 1 terminated, sid=8294 > 07/31/2011 23:00:08;0008; pbs_mom;Job;6780.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:08;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:08;0080; pbs_mom;Job;6780.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:08;0001; pbs_mom;Job;TMomFinalizeJob3;job > 6781.naboo.linnbenton.edu started, pid = 8439 > 07/31/2011 23:00:10;0080; > pbs_mom;Job;6781.naboo.linnbenton.edu;scan_for_terminated: job > 6781.naboo.linnbenton.edu task 1 terminated, sid=8439 > 07/31/2011 23:00:10;0008; pbs_mom;Job;6781.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:10;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:10;0080; pbs_mom;Job;6781.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7141.naboo.linnbenton.edu started, pid = 8579 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7143.naboo.linnbenton.edu started, pid = 8582 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7146.naboo.linnbenton.edu started, pid = 8589 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7147.naboo.linnbenton.edu started, pid = 8603 > 07/31/2011 23:00:10;0080; > pbs_mom;Job;7141.naboo.linnbenton.edu;scan_for_terminated: job > 7141.naboo.linnbenton.edu task 1 terminated, sid=8579 > 07/31/2011 23:00:10;0008; pbs_mom;Job;7141.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:10;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:10;0080; pbs_mom;Job;7141.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7143.naboo.linnbenton.edu;scan_for_terminated: job > 7143.naboo.linnbenton.edu task 1 terminated, sid=8582 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7146.naboo.linnbenton.edu;scan_for_terminated: job > 7146.naboo.linnbenton.edu task 1 terminated, sid=8589 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7147.naboo.linnbenton.edu;scan_for_terminated: job > 7147.naboo.linnbenton.edu task 1 terminated, sid=8603 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7143.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0008; pbs_mom;Job;7146.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Job;7143.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; pbs_mom;Job;7146.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0008; pbs_mom;Job;7147.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7115.naboo.linnbenton.edu;scan_for_terminated: job > 7115.naboo.linnbenton.edu task 1 terminated, sid=28947 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7147.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0008; pbs_mom;Job;7115.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7116.naboo.linnbenton.edu;scan_for_terminated: job > 7116.naboo.linnbenton.edu task 1 terminated, sid=28966 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7118.naboo.linnbenton.edu;scan_for_terminated: job > 7118.naboo.linnbenton.edu task 1 terminated, sid=29020 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7117.naboo.linnbenton.edu;scan_for_terminated: job > 7117.naboo.linnbenton.edu task 1 terminated, sid=29083 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7116.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0008; pbs_mom;Job;7118.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Job;7116.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; pbs_mom;Job;7115.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; pbs_mom;Job;7118.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0008; pbs_mom;Job;7117.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Job;7117.naboo.linnbenton.edu;obit > sent to server > > ? > *********** Dump-2: > # printserverdb > --------------------------------------------------- > numjobs: 50 > numque: 5 > jobidnumber: 615 > savetm: 1314100391 > --attributes-- > scheduling = True > max_running = 23 > total_jobs = 22 > state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0 > default_queue = sys_tst > log_events = 511 > mail_from = adm > query_other_jobs = False > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = False > pbs_version = 2.5.6 > keep_completed = 600 > allow_node_submit = True > next_job_number = 1 > net_counter = 7 1 0 > # qnodes > naboo > state = free > np = 40 > ntype = cluster > jobs = 0/2.naboo.linnbenton.edu, 2/461.naboo.linnbenton.edu, > 3/462.naboo.linnbenton.edu, 4/463.naboo.linnbenton.edu, > 5/466.naboo.linnbenton.edu, 6/467.naboo.linnbenton.edu, > 7/468.naboo.linnbenton.edu > status = rectime=1314203130,varattr=,jobs=2.naboo.linnbenton.edu > 461.naboo.linnbenton.edu 462.naboo.linnbenton.edu > 463.naboo.linnbenton.edu 466.naboo.linnbenton.edu > 467.naboo.linnbenton.edu > 468.naboo.linnbenton.edu,state=free,netload=1125010513873,gres=,loadave=1.37,ncpus=4,physmem=17040092kb,availmem=23052344kb,totmem=29739432kb,idletime=71635,nusers=11,nsessions=163,sessions=612 > 614 616 618 620 622 624 626 628 630 632 634 636 638 640 642 644 646 648 > 650 652 659 661 665 678 680 682 684 686 688 690 692 694 696 698 700 702 > 704 706 708 710 712 716 720 725 727 729 731 733 735 737 739 741 743 745 > 747 749 751 753 755 757 759 769 771 776 778 780 782 784 786 788 790 792 > 794 796 798 800 802 804 806 808 810 820 822 829 899 901 903 905 909 5763 > 911 1547 1648 2569 2635 2691 2753 2814 2878 2932 5839 5875 7864 7985 > 7987 7989 7993 7995 7997 7999 8001 8003 8005 8007 8009 8011 8013 8015 > 8017 8019 8021 8023 8025 8027 8029 8129 8131 8163 8165 8167 10447 10505 > 10562 10618 10706 11904 11966 11981 12022 12080 12433 13937 14899 15031 > 15282 16152 17383 22451 23720 31671 31673 31675 31677 31683 31708 31712 > 31979 32003 32088 32091 32102 32116,uname=Linux naboo.linnbenton.edu > 2.6.18-238.12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 > x86_64,opsys=linux > gpus = 0 > # qmgr -c "l s" > Server naboo.linnbenton.edu > server_state = Active > scheduling = True > max_running = 23 > total_jobs = 41 > state_count = Transit:0 Queued:0 Held:3 Waiting:21 Running:7 Exiting:0 > acl_hosts = naboo.linnbenton.edu,naboo > managers = _usr1_ > operators = > _usr1,_ usr2, .... > default_queue = sys_tst > log_events = 511 > mail_from = adm > query_other_jobs = False > resources_assigned.nodect = 7 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = False > pbs_version = 2.5.6 > keep_completed = 600 > allow_node_submit = True > next_job_number = 625 > net_counter = 6 2 1 > # qmgr -c "l q sys_ban" # This is our main queue > Queue sys_ban > queue_type = Execution > max_queuable = 300 > total_jobs = 21 > state_count = Transit:0 Queued:0 Held:0 Waiting:15 Running:0 Exiting:0 > max_running = 1 > resources_default.nodes = 1 > resources_default.walltime = 168:00:00 > mtime = Sat Aug 20 05:25:20 2011 > resources_assigned.nodect = 0 > enabled = True > started = True > # qstat -q > server: naboo.linnbenton.edu > Queue Memory CPU Time Walltime Node Run Que Lm State > ---------------- ------ -------- -------- ---- --- --- -- ----- > sys_ban -- -- -- -- 0 15 1 E R > sys_srv -- -- -- -- 7 8 10 E R > sys_tst -- -- -- -- 0 1 1 E R > sys_ban_quick -- -- -- -- 0 0 1 E R > sys_queue -- -- -- -- 0 0 2 E R > ----- ----- > 7 24 > From gus at ldeo.columbia.edu Wed Sep 7 09:39:55 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 07 Sep 2011 11:39:55 -0400 Subject: [torqueusers] Help! One Puzzle At a Time... In-Reply-To: <1315406460.16926.YahooMailNeo@web110609.mail.gq1.yahoo.com> References: <1314572743.54682.YahooMailNeo@web110608.mail.gq1.yahoo.com> <1315318065.97191.YahooMailNeo@web110615.mail.gq1.yahoo.com> <4E662D7D.6050307@fi.muni.cz> <1315406460.16926.YahooMailNeo@web110609.mail.gq1.yahoo.com> Message-ID: <4E67904B.9020208@ldeo.columbia.edu> Hi Sam One of your error messages complains that ${TORQUE}/mom_priv/config doesn't exist. Why not create it, then restart. Can you resolve the short name 'naboo'? Is it via DNS, /etc/hosts, something else? Does the short name 'naboo' have an IP address? (ping -c 1 naboo) Anyway, since this is a single machine, have you tried to use 'localhost' in the ${torque}/server_name, ${torque}/nodes file (and in ${TORQUE}/mom_priv/config if you want to create it) instead of naboo? (Then restart everything.) I wonder if this would be a simpler approach for a single machine. I hope this helps, Gus Correa sam oubari wrote: > Hi Gus, > > Again, thank you for helping. I can't use my work email to post, not > sure why, and Yahoo does not handle listservs will. So please post your > response instead. > > 1) I don't have ${TORQUE}/mom_priv/config. Do I simply create one like > shown below? > 2) Yes, I restarted. I am still puzzled why node settings don't show in > qmgr when I issue "l n" and why they don't seem to stick when I activate? > 3) I don't have PBS defined as a service, I start/stop from command line. > 4) New clues: > In /var/log/messages, when dying (usually once a week): > Sep 5 23:01:18 naboo pbs_server: LOG_WARNING::Expired credential > (15022) in send_job, child timed-out attempting to start job > 3451.naboo.linnbenton.edu > Sep 5 23:01:18 naboo pbs_server: LOG_ERROR::stream_eof, connection to > naboo is bad, remote service may be down, message may be corrupt, or > connection may have been dropped remotely (No error). setting node > state to down > When restarting MOM: > Sep 6 06:26:04 naboo pbs_mom: LOG_ERROR::No such file or directory (2) > in read_config, fstat: config > In sched_logs/20110905: > 09/05/2011 23:10:00;0040; pbs_sched;Job;3451.naboo.linnbenton.edu;Not > enough of the right type of nodes available > 09/05/2011 23:13:00;0040; > pbs_sched;Job;3571.naboo.linnbenton.edu;Draining system to allow > 3451.naboo.linnbenton.edu to run > 09/05/2011 23:30:00;0040; > pbs_sched;Job;3453.naboo.linnbenton.edu;Draining system to allow > 3451.naboo.linnbenton.edu to run > Sam. > > Gus Correa gus at ldeo.columbia.edu > > /Wed Sep 7 07:16:03 MDT 2011/ > Hi Sam I added your original message below, so that other people can > read it. Do you have a ${TORQUE}/mom_priv/config file, pointing to > your pbs_server, probably: $pbsserver naboo [Assuming naboo is the > server name in ${TORQUE}/sever_name.] Did you restart pbs_server > after you modified ${TORQUE}/server_priv/nodes, etc? (service > pbs_server restart) Anything on your /var/log/messages file telling > why pbs_mom dies? I hope this helps, Gus sam oubari wrote: > > Hi Gus, > > > > I am using pbs_sched and all is one one server. To clarify, on > > occasions, jobs stay in Q until I bounce MOM. I am pretty sure > > something is wrong with my only node. Sam. > > ** > > *Gus Correa* gus at ldeo.columbia.edu > > wrote > > on /Tue Sep 6 08:25:43 MDT 2011/ : > > > > Regarding the long time in Q state after H state. > > If you are using the maui scheduler, this may be due to the default > > defertime of 1 hour. > > In this case, try setting it to less. > > For instance, if you want it to be one minute, add this line: > > DEFERTIME 00:01:00 > > to your ${MAUI}/maui.cfg file and restart maui. > > > > See also: > > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > > > Not sure if I understood it right, but > > for the 'resource temporarily unavailable' problem, > > qnodes is reporting the 'naboo' node as 'down', hence unavailable. > > It may need a reboot. > > > > I hope this helps, > > Gus Correa > > sam oubari wrote: > Hello, > I hope someone can help: > 1) When we were running 2.4.11, every few weeks, pbs_sched would die. We > upgraded using a fresh install to 2.5.6 about two months ago, and we > configured it like we did with 2.4.11 using: > ./configure --enable-docs --disable-dependency-tracking > --disable-libtool-lock --with-scp > Now, almost every Sunday at 11pm (we do fire up a few jobs but we do > that every AM and PM), mom dies or defunct, eg: > $ ps -ef|grep pbs > root 6704 1 0 Jul13 ? 00:02:18 /usr/local/sbin/pbs_server > root 6910 1 0 Jul13 ? 00:00:56 /usr/local/sbin/pbs_sched > rpt_devl 8871 10997 0 Jul31 ? 00:00:00 [pbs_mom] > root 10997 1 4 Jul25 ? 07:48:14 /usr/local/sbin/pbs_mom > Usually, at that time, there are 4 jobs waiting to execute to perform > clean up on 4 DBs, and that seems to get MOM stuck. > See Dump-1 at the bottom. Our current config is shown below Dump-1 as > Dump-2. > 2) Both 2.4.x and 2.5.x occasionally don't schedule a waiting job, if I > recall, it goes from W to Q but not R. When that happens, I force it > with QRUN. > 3) I manually had created server_priv/nodes (I just made np=40, it used > to be 20): > # echo "naboo np=40">/var/spool/torque/server_priv/nodes > But I still cannot verify within qmgr: > # qmgr > list nodes > No Active Nodes, nothing done. > 4) I manually configured by starting with "pbs_server -t create", but > now I am missing $TORQUE_HOME/mom_priv/config. For my simple install, is > it required? > 5) Speaking of qmgr, most the time when I enter it quits without an > output after I issue my 1st command. I re-enter immediately, then it > accepts all my commands with no problem. This has been true for 2.4.x > and 2.5.x. > Any ideas? If things don't improve, I am planning to revert back to > 2.4.x. Thx! Sam. > ------ > Sam Oubari, Manager of Systems & Application Programming > Linn-Benton Community College -- Information Services > 6500 Pacific Blvd SW, Room# CC 110E -- Albany OR 97321 > Tel: 541-917-4355/Fax: 541-917-4379 > *********** Dump-1: > 07/31/2011 22:55:37;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.6, loglevel = 0 > 07/31/2011 23:00:00;0001; pbs_mom;Job;TMomFinalizeJob3;job > 6780.naboo.linnbenton.edu started, pid = 8294 > 07/31/2011 23:00:08;0080; > pbs_mom;Job;6780.naboo.linnbenton.edu;scan_for_terminated: job > 6780.naboo.linnbenton.edu task 1 terminated, sid=8294 > 07/31/2011 23:00:08;0008; pbs_mom;Job;6780.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:08;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:08;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:08;0080; pbs_mom;Job;6780.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:08;0001; pbs_mom;Job;TMomFinalizeJob3;job > 6781.naboo.linnbenton.edu started, pid = 8439 > 07/31/2011 23:00:10;0080; > pbs_mom;Job;6781.naboo.linnbenton.edu;scan_for_terminated: job > 6781.naboo.linnbenton.edu task 1 terminated, sid=8439 > 07/31/2011 23:00:10;0008; pbs_mom;Job;6781.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:10;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:10;0080; pbs_mom;Job;6781.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7141.naboo.linnbenton.edu started, pid = 8579 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7143.naboo.linnbenton.edu started, pid = 8582 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7146.naboo.linnbenton.edu started, pid = 8589 > 07/31/2011 23:00:10;0001; pbs_mom;Job;TMomFinalizeJob3;job > 7147.naboo.linnbenton.edu started, pid = 8603 > 07/31/2011 23:00:10;0080; > pbs_mom;Job;7141.naboo.linnbenton.edu;scan_for_terminated: job > 7141.naboo.linnbenton.edu task 1 terminated, sid=8579 > 07/31/2011 23:00:10;0008; pbs_mom;Job;7141.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:10;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:10;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:10;0080; pbs_mom;Job;7141.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7143.naboo.linnbenton.edu;scan_for_terminated: job > 7143.naboo.linnbenton.edu task 1 terminated, sid=8582 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7146.naboo.linnbenton.edu;scan_for_terminated: job > 7146.naboo.linnbenton.edu task 1 terminated, sid=8589 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7147.naboo.linnbenton.edu;scan_for_terminated: job > 7147.naboo.linnbenton.edu task 1 terminated, sid=8603 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7143.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0008; pbs_mom;Job;7146.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Job;7143.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; pbs_mom;Job;7146.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0008; pbs_mom;Job;7147.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7115.naboo.linnbenton.edu;scan_for_terminated: job > 7115.naboo.linnbenton.edu task 1 terminated, sid=28947 > 07/31/2011 23:00:11;0080; pbs_mom;Job;7147.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0008; pbs_mom;Job;7115.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7116.naboo.linnbenton.edu;scan_for_terminated: job > 7116.naboo.linnbenton.edu task 1 terminated, sid=28966 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7118.naboo.linnbenton.edu;scan_for_terminated: job > 7118.naboo.linnbenton.edu task 1 terminated, sid=29020 > 07/31/2011 23:00:11;0080; > pbs_mom;Job;7117.naboo.linnbenton.edu;scan_for_terminated: job > 7117.naboo.linnbenton.edu task 1 terminated, sid=29083 > 07/31/2011 23:00:11;0008; pbs_mom;Job;7116.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0008; pbs_mom;Job;7118.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Job;7116.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; pbs_mom;Job;7115.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0080; pbs_mom;Job;7118.naboo.linnbenton.edu;obit > sent to server > 07/31/2011 23:00:11;0008; pbs_mom;Job;7117.naboo.linnbenton.edu;job was > terminated > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 07/31/2011 23:00:11;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top > of while loop > 07/31/2011 23:00:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 07/31/2011 23:00:11;0080; pbs_mom;Job;7117.naboo.linnbenton.edu;obit > sent to server > > ? > *********** Dump-2: > # printserverdb > --------------------------------------------------- > numjobs: 50 > numque: 5 > jobidnumber: 615 > savetm: 1314100391 > --attributes-- > scheduling = True > max_running = 23 > total_jobs = 22 > state_count = Transit:0 Queued:0 Held:2 Waiting:17 Running:0 Exiting:0 > default_queue = sys_tst > log_events = 511 > mail_from = adm > query_other_jobs = False > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = False > pbs_version = 2.5.6 > keep_completed = 600 > allow_node_submit = True > next_job_number = 1 > net_counter = 7 1 0 > # qnodes > naboo > state = free > np = 40 > ntype = cluster > jobs = 0/2.naboo.linnbenton.edu, 2/461.naboo.linnbenton.edu, > 3/462.naboo.linnbenton.edu, 4/463.naboo.linnbenton.edu, > 5/466.naboo.linnbenton.edu, 6/467.naboo.linnbenton.edu, > 7/468.naboo.linnbenton.edu > status = rectime=1314203130,varattr=,jobs=2.naboo.linnbenton.edu > 461.naboo.linnbenton.edu 462.naboo.linnbenton.edu > 463.naboo.linnbenton.edu 466.naboo.linnbenton.edu > 467.naboo.linnbenton.edu > 468.naboo.linnbenton.edu,state=free,netload=1125010513873,gres=,loadave=1.37,ncpus=4,physmem=17040092kb,availmem=23052344kb,totmem=29739432kb,idletime=71635,nusers=11,nsessions=163,sessions=612 > 614 616 618 620 622 624 626 628 630 632 634 636 638 640 642 644 646 648 > 650 652 659 661 665 678 680 682 684 686 688 690 692 694 696 698 700 702 > 704 706 708 710 712 716 720 725 727 729 731 733 735 737 739 741 743 745 > 747 749 751 753 755 757 759 769 771 776 778 780 782 784 786 788 790 792 > 794 796 798 800 802 804 806 808 810 820 822 829 899 901 903 905 909 5763 > 911 1547 1648 2569 2635 2691 2753 2814 2878 2932 5839 5875 7864 7985 > 7987 7989 7993 7995 7997 7999 8001 8003 8005 8007 8009 8011 8013 8015 > 8017 8019 8021 8023 8025 8027 8029 8129 8131 8163 8165 8167 10447 10505 > 10562 10618 10706 11904 11966 11981 12022 12080 12433 13937 14899 15031 > 15282 16152 17383 22451 23720 31671 31673 31675 31677 31683 31708 31712 > 31979 32003 32088 32091 32102 32116,uname=Linux naboo.linnbenton.edu > 2.6.18-238.12.1.0.1.el5 #1 SMP Tue May 31 14:51:07 EDT 2011 > x86_64,opsys=linux > gpus = 0 > # qmgr -c "l s" > Server naboo.linnbenton.edu > server_state = Active > scheduling = True > max_running = 23 > total_jobs = 41 > state_count = Transit:0 Queued:0 Held:3 Waiting:21 Running:7 Exiting:0 > acl_hosts = naboo.linnbenton.edu,naboo > managers = _usr1_ > operators = > _usr1,_ usr2, .... > default_queue = sys_tst > log_events = 511 > mail_from = adm > query_other_jobs = False > resources_assigned.nodect = 7 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = False > pbs_version = 2.5.6 > keep_completed = 600 > allow_node_submit = True > next_job_number = 625 > net_counter = 6 2 1 > # qmgr -c "l q sys_ban" # This is our main queue > Queue sys_ban > queue_type = Execution > max_queuable = 300 > total_jobs = 21 > state_count = Transit:0 Queued:0 Held:0 Waiting:15 Running:0 Exiting:0 > max_running = 1 > resources_default.nodes = 1 > resources_default.walltime = 168:00:00 > mtime = Sat Aug 20 05:25:20 2011 > resources_assigned.nodect = 0 > enabled = True > started = True > # qstat -q > server: naboo.linnbenton.edu > Queue Memory CPU Time Walltime Node Run Que Lm State > ---------------- ------ -------- -------- ---- --- --- -- ----- > sys_ban -- -- -- -- 0 15 1 E R > sys_srv -- -- -- -- 7 8 10 E R > sys_tst -- -- -- -- 0 1 1 E R > sys_ban_quick -- -- -- -- 0 0 1 E R > sys_queue -- -- -- -- 0 0 2 E R > ----- ----- > 7 24 > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From ataufer at adaptivecomputing.com Wed Sep 7 12:48:41 2011 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Wed, 07 Sep 2011 12:48:41 -0600 (MDT) Subject: [torqueusers] NVIDIA GPUs version error In-Reply-To: Message-ID: <7aaf3c88-7e62-4953-9a07-3c73f6edc301@mail> We have only tested Torque using the 260 and 270 Nvidia Drivers so driver versions greater than 270 are not yet recognized. I am in the process of testing with the 275 and 280 drivers and hope to update Torque this week so it will accept any driver version greater than 260. There were major changes between the 260 and 270 driver versions and we should be okay with future driver releases as long as the Nvidia interface does not change. Al ----- Original Message ----- > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi all, > > I'm getting errors in my syslog from our gpu nodes pbs_moms: > > Aug 22 15:55:09 blugpu07 pbs_mom: LOG_ERROR::a system error occured > (15205) in generate_server_gpustatus_smi, Unknown Nvidia driver > version > > Here is the snipped output of pbsnodes blugpu07: > > gpu_status = > gpu[1]=gpu_id=0:15:0;,gpu[0]=gpu_id=0:14:0;,driver_ver=275.09.07,timestamp=Mon > Aug 22 15:56:41 2011 > > > If I login to the node, and check the pbs_mom logfiles, I see the > following: > > 08/22/2011 15:57:24;0002; > pbs_mom;n/a;mom_server_all_update_gpustat;composing gpu status > update for server > 08/22/2011 15:57:24;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::gpus, gpus: > GPU cmd issued: nvidia-smi -a -x 2>&1 > 08/22/2011 15:57:26;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::a system > error occured (15205) in generate_server_gpustatus_smi, Unknown > Nvidia driver versio n > 08/22/2011 15:57:26;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::a system > error occured (15205) in generate_server_gpustatus_smi, Unknown > Nvidia driver versio n > 08/22/2011 15:57:26;0002; > pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: > sending to server "timestamp=Mon Aug 22 15:57:26 2011" > 08/22/2011 15:57:26;0002; > pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: > sending to server "driver_ver=275.09.07" > 08/22/2011 15:57:26;0002; > pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: > sending to server "gpuid=0:14:0" > 08/22/2011 15:57:26;0002; > pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: > sending to server "gpuid=0:15:0" > 08/22/2011 15:57:26;0002; > pbs_mom;n/a;mom_server_update_gpustat;status update successfully > sent to bhsn-int > > > Is this driver version we have not supported by torque? > > > > Environment: > - TORQUE-2.5.6 > - NVIDIA Driver Version : 275.09.07 > - kernel: 2.6.18-238.12.1.el5 > > - TORQUE client was build via: > This build was configured with: '''--prefix=/opt/torque/2.5.6' > '--exec-prefix=/opt/torque/2.5.6/x86_64' > '--with-server-home=/var/spool/pbs' '--enable-syslog' '--with-scp' > '--disable-rpp' '--disable-spool' '--with-pam' '--with-cpusets' > '--with-geometry-requests' '--disable-gui' '--enable-nvidia-gpus' > '--enable-docs' > > > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOUrawAAoJENS19LGOpgqKwkoIAIQrY8rZn+J+vaSgnTElGxvu > KcMYlqkiBBZtix7YBCVMsHTv5PcOPT/4l1qHX4/7/P9ZW6Xc542LNKLJrd46FcLa > cmbkixUaGRJ5SDCVSyA6YzZZIBDHBjP3AMrIouDwjyOEhR3A9agI5yYPdFTRdcNQ > NoagT372lZnhVfPUYrVLM8oVIbS+KsZZGiYA4HShsbPUB/qqU/YqNroLlg7o8lVX > gHBY7C231TpC/YAJx1xZ5qjSSl1/mtzK8PuzqZ5mWBFtoXFvlzXFe+C0uqcCHLh2 > jjkGeRU09YCkHEuqJy+iQ/KDGgvAFSmyuDgWq3RPJX8c7xw+y7saDLjhH9vPdVg= > =zdfO > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From bino at coc.ufrj.br Thu Sep 8 05:17:55 2011 From: bino at coc.ufrj.br (Albino A. Aveleda) Date: Thu, 8 Sep 2011 08:17:55 -0300 (BRT) Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <1386354672.5491.1315480632784.JavaMail.root@mailhost.coc.ufrj.br> Message-ID: <2058708068.5493.1315480675517.JavaMail.root@mailhost.coc.ufrj.br> Hi Gus, Yes, my machine is 64bit. When I received this error I had only 64-bit packages. Then I installed the 32-bit too, but I had the same error. All configuration in tkConfig.sh file are 64-bit. I don?t know what happen. I don?t have this problem on RHEL and CentOS linux. Best regards, Bibo ----- Mensagem original ----- De: "Gus Correa" Para: "Torque Users Mailing List" Enviadas: Ter?a-feira, 6 de Setembro de 2011 18:34:56 Assunto: Re: [torqueusers] enable-gui problem on SLES11-SP1 Hi Albino Do you need perhaps the 64-bit Tcl/Tk packages, instead of 32-bit? Since configure is looking for Tcl/Tk items in /usr/lib64, I guess your machine is 64-bit, right? Anyway, just a guess. I hope it helps, Gus Correa Albino A. Aveleda wrote: > Hi All, > > I am trying to compile torque-2.5.8 with --enable-gui option on SLES11-SP1 > but I always receive the error below. The tcl and tk packages are installed. > > tcl-32bit-8.5.5-2.81 > tcl-devel-8.5.5-2.81 > tclx-8.4-470.22 > tcl-8.5.5-2.81 > tk-32bit-8.5.5-3.12 > tk-8.5.5-3.12 > tk-devel-8.5.5-3.12 > > I have problem with Tk_Init. The error message is below. > [... skip ...] > configure: Starting Tcl/Tk configuration > checking for Tcl configuration... found /usr/lib64/tclConfig.sh > checking for existence of /usr/lib64/tclConfig.sh... loading > checking for Tcl public headers... /usr/include > checking for tclx configuration... none > checking for Tk configuration... found /usr/lib64/tkConfig.sh > checking for existence of /usr/lib64/tkConfig.sh... loading > checking for Tk public headers... /usr/include > checking for Tcl_Init... yes > checking for Tk_Init... no > configure: Your Tk install is broken. Disabling Tk support. > checking whether to include the GUI-clients... configure: error: cannot build GUI without Tk library > > How do I fix this? > > Best regards, > Bibo From andre.gemuend at scai.fraunhofer.de Thu Sep 8 01:59:22 2011 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Thu, 08 Sep 2011 09:59:22 +0200 (CEST) Subject: [torqueusers] File staging syntax In-Reply-To: <8ea68c7c-da34-4071-add2-3389704e23cd@zimbra.scai.fraunhofer.de> Message-ID: <9596036d-ff12-43a6-a8ce-ec19ca705051@zimbra.scai.fraunhofer.de> Hello everyone, is it possible that the -W syntax changed again between 2.5.5 and 2.5.8? We were using 2.5.5 without problems, but since I upgraded to 2.5.8 yesterday, PBS scripts with more than one file per staging line failed with "illegal -W syntax". I had to change the scripts to use seperate -W lines for every file. I didn't see this in the changelog, or maybe I just missed it? Greetings Andr? -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From gabe at msi.umn.edu Thu Sep 8 08:32:45 2011 From: gabe at msi.umn.edu (Gabe Turner) Date: Thu, 8 Sep 2011 09:32:45 -0500 Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <2058708068.5493.1315480675517.JavaMail.root@mailhost.coc.ufrj.br> References: <1386354672.5491.1315480632784.JavaMail.root@mailhost.coc.ufrj.br> <2058708068.5493.1315480675517.JavaMail.root@mailhost.coc.ufrj.br> Message-ID: <20110908143245.GC21637@blackice.msi.umn.edu> On Thu, Sep 08, 2011 at 08:17:55AM -0300, Albino A. Aveleda wrote: > Hi Gus, > > Yes, my machine is 64bit. When I received this error I had only 64-bit > packages. Then I installed the 32-bit too, but I had the same error. > All configuration in tkConfig.sh file are 64-bit. > > I don?t know what happen. I don?t have this problem on RHEL and CentOS > linux. > I don't seem to have this problem on SLE11 SP1. We don't use the GUI, so I never build it, but I just tried a 'configure --enable-gui' and it worked fine: -------- [~/torque-2.5.8] % ./configure --enable-gui . . . configure: Starting Tcl/Tk configuration checking for Tcl configuration... found /usr/lib64/tclConfig.sh checking for existence of /usr/lib64/tclConfig.sh... loading checking for Tcl public headers... /usr/include checking for tclx configuration... none checking for Tk configuration... found /usr/lib64/tkConfig.sh checking for existence of /usr/lib64/tkConfig.sh... loading checking for Tk public headers... /usr/include checking for Tcl_Init... yes checking for Tk_Init... yes checking whether to include the GUI-clients... yes checking for tclsh... /usr/bin/tclsh8.5 checking for wish... /usr/bin/wish8.5 checking checking for Tcl attribute seperator... . checking whether to enable tcl-qstat... yes configure: Finished Tcl/Tk configuration . . . Building components: server=yes mom=yes clients=yes gui=yes drmaa=no pam=no PBS Machine type: linux Remote copy: /usr/bin/scp -rpB PBS home: /var/spool/torque Default server: serverhost Unix Domain sockets: no Tcl: -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm Tk: -L/usr/lib64 -ltk8.5 -L/usr/lib64 -lX11 -lXss -lXext -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm Ready for 'make'. -------- I do notice that your packages are a bit out of date. These are the latest: % rpm -qa | egrep '^t(cl|k)-' tcl-8.5.5-2.81 tk-8.5.5-3.14.1 tcl-devel-8.5.5-2.81 tk-devel-8.5.5-3.14.1 Perhaps you are hitting a bug in the packages you have. Otherwise, I'd posit that something has gone awry on your system in in your environment. HTH, Gabe -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From arnaubria at pic.es Thu Sep 8 08:40:23 2011 From: arnaubria at pic.es (Arnau Bria) Date: Thu, 8 Sep 2011 16:40:23 +0200 Subject: [torqueusers] exit_status=-10 when 271 expected Message-ID: <20110908164023.27a9788d@amarrosa.pic.es> Hi all, I'm testing torque-2.5.7-1.cri on new server. I've a queue defaults: resources_max.cput = 00:10:00 resources_max.walltime = 00:10:00 and I'm sending a jobs like: $ echo sleep 900 |qsub -q gshort_sl5 17727921.pbs01-test.pic.es that jobs lasts more than 10 minutes, so queues have kill it. It works "fine", but the exit_value returned is -10: Job: 17727921.pbs01-test.pic.es 09/08/2011 16:30:11 S enqueuing into gshort_sl5, state 1 hop 1 09/08/2011 16:30:11 S Job Queued at request of dteam001 at dcgftp01.pic.es, owner = dteam001 at dcgftp01.pic.es, job name = STDIN, queue = gshort_sl5 09/08/2011 16:30:11 A queue=gshort_sl5 09/08/2011 16:30:18 S Job Run at request of root at pbs01-test.pic.es 09/08/2011 16:30:18 A user=dteam001 group=dteam jobname=STDIN queue=gshort_sl5 ctime=1315492211 qtime=1315492211 etime=1315492211 start=1315492218 owner=dteam001 at dcgftp01.pic.es exec_host=tditaller025.pic.es/0 Resource_List.cput=00:10:00 Resource_List.neednodes=slc5_x64 Resource_List.walltime=00:10:00 09/08/2011 16:35:34 S Exit_status=-10 resources_used.cput=00:00:00 resources_used.mem=2864kb resources_used.vmem=191156kb resources_used.walltime=00:11:08 09/08/2011 16:35:34 A user=dteam001 group=dteam jobname=STDIN queue=gshort_sl5 ctime=1315492211 qtime=1315492211 etime=1315492211 start=1315492218 owner=dteam001 at dcgftp01.pic.es exec_host=tditaller025.pic.es/0 Resource_List.cput=00:10:00 Resource_List.neednodes=slc5_x64 Resource_List.walltime=00:10:00 session=29934 end=1315492534 Exit_status=-10 resources_used.cput=00:00:00 resources_used.mem=2864kb resources_used.vmem=191156kb resources_used.walltime=00:11:08 09/08/2011 16:35:35 S dequeuing from gshort_sl5, state COMPLETE But, isn't exit_status 271 expected? What's the meaning of that -10? TIA, Arnau From gus at ldeo.columbia.edu Thu Sep 8 09:00:46 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 08 Sep 2011 11:00:46 -0400 Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <20110908143245.GC21637@blackice.msi.umn.edu> References: <1386354672.5491.1315480632784.JavaMail.root@mailhost.coc.ufrj.br> <2058708068.5493.1315480675517.JavaMail.root@mailhost.coc.ufrj.br> <20110908143245.GC21637@blackice.msi.umn.edu> Message-ID: <4E68D89E.4000809@ldeo.columbia.edu> Gabe Turner wrote: > On Thu, Sep 08, 2011 at 08:17:55AM -0300, Albino A. Aveleda wrote: >> Hi Gus, >> >> Yes, my machine is 64bit. When I received this error I had only 64-bit >> packages. Then I installed the 32-bit too, but I had the same error. >> All configuration in tkConfig.sh file are 64-bit. >> >> I don?t know what happen. I don?t have this problem on RHEL and CentOS >> linux. >> > > I don't seem to have this problem on SLE11 SP1. We don't use the GUI, so I > never build it, but I just tried a 'configure --enable-gui' and it worked > fine: > > -------- > [~/torque-2.5.8] % ./configure --enable-gui > . > . > . > configure: Starting Tcl/Tk configuration > checking for Tcl configuration... found /usr/lib64/tclConfig.sh > checking for existence of /usr/lib64/tclConfig.sh... loading > checking for Tcl public headers... /usr/include > checking for tclx configuration... none > checking for Tk configuration... found /usr/lib64/tkConfig.sh > checking for existence of /usr/lib64/tkConfig.sh... loading > checking for Tk public headers... /usr/include > checking for Tcl_Init... yes > checking for Tk_Init... yes > checking whether to include the GUI-clients... yes > checking for tclsh... /usr/bin/tclsh8.5 > checking for wish... /usr/bin/wish8.5 > checking checking for Tcl attribute seperator... . > checking whether to enable tcl-qstat... yes > configure: Finished Tcl/Tk configuration > . > . > . > Building components: server=yes mom=yes clients=yes > gui=yes drmaa=no pam=no > PBS Machine type: linux > Remote copy: /usr/bin/scp -rpB > PBS home: /var/spool/torque > Default server: serverhost > Unix Domain sockets: no > Tcl: -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm > Tk: -L/usr/lib64 -ltk8.5 -L/usr/lib64 -lX11 -lXss -lXext -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm > > Ready for 'make'. > -------- > > I do notice that your packages are a bit out of date. These are the latest: > > % rpm -qa | egrep '^t(cl|k)-' > tcl-8.5.5-2.81 > tk-8.5.5-3.14.1 > tcl-devel-8.5.5-2.81 > tk-devel-8.5.5-3.14.1 > > Perhaps you are hitting a bug in the packages you have. Otherwise, I'd > posit that something has gone awry on your system in in your environment. > > HTH, > > Gabe Hi Albino Sorry, it didn't help. I don't have any SLES machine to try. On CentOS 5.2 and 5.4 and on various Fedora releases I have no problem whatsoever with the Torque gui. It installs and works, as long as the tcl and tk packages and devel packages are installed. Gus Correa From bino at coc.ufrj.br Thu Sep 8 10:17:46 2011 From: bino at coc.ufrj.br (Albino A. Aveleda) Date: Thu, 8 Sep 2011 13:17:46 -0300 (BRT) Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <573533145.5940.1315498554954.JavaMail.root@mailhost.coc.ufrj.br> Message-ID: <547459510.5942.1315498666092.JavaMail.root@mailhost.coc.ufrj.br> Hi Gabe, I updated the tk packages. # rpm -qa | egrep '^t(cl|k)-' tk-8.5.5-3.14.1 tk-devel-8.5.5-3.14.1 tcl-devel-8.5.5-2.81 tcl-8.5.5-2.81 And now I have the same version that you. I tried to compile again. # ./configure --enable-gui [... skip ...] configure: Starting Tcl/Tk configuration checking for Tcl configuration... found /usr/lib64/tclConfig.sh checking for existence of /usr/lib64/tclConfig.sh... loading checking for Tcl public headers... /usr/include checking for tclx configuration... none checking for Tk configuration... found /usr/lib64/tkConfig.sh checking for existence of /usr/lib64/tkConfig.sh... loading checking for Tk public headers... /usr/include checking for Tcl_Init... yes checking for Tk_Init... no configure: Your Tk install is broken. Disabling Tk support. checking whether to include the GUI-clients... configure: error: cannot build GUI without Tk library But I received the same error. Do you have some sugestion? Best regards, Bibo ----- Mensagem original ----- De: "Gabe Turner" Para: torqueusers at supercluster.org Enviadas: Quinta-feira, 8 de Setembro de 2011 11:32:45 Assunto: Re: [torqueusers] enable-gui problem on SLES11-SP1 On Thu, Sep 08, 2011 at 08:17:55AM -0300, Albino A. Aveleda wrote: > Hi Gus, > > Yes, my machine is 64bit. When I received this error I had only 64-bit > packages. Then I installed the 32-bit too, but I had the same error. > All configuration in tkConfig.sh file are 64-bit. > > I don?t know what happen. I don?t have this problem on RHEL and CentOS > linux. > I don't seem to have this problem on SLE11 SP1. We don't use the GUI, so I never build it, but I just tried a 'configure --enable-gui' and it worked fine: -------- [~/torque-2.5.8] % ./configure --enable-gui . . . configure: Starting Tcl/Tk configuration checking for Tcl configuration... found /usr/lib64/tclConfig.sh checking for existence of /usr/lib64/tclConfig.sh... loading checking for Tcl public headers... /usr/include checking for tclx configuration... none checking for Tk configuration... found /usr/lib64/tkConfig.sh checking for existence of /usr/lib64/tkConfig.sh... loading checking for Tk public headers... /usr/include checking for Tcl_Init... yes checking for Tk_Init... yes checking whether to include the GUI-clients... yes checking for tclsh... /usr/bin/tclsh8.5 checking for wish... /usr/bin/wish8.5 checking checking for Tcl attribute seperator... . checking whether to enable tcl-qstat... yes configure: Finished Tcl/Tk configuration . . . Building components: server=yes mom=yes clients=yes gui=yes drmaa=no pam=no PBS Machine type: linux Remote copy: /usr/bin/scp -rpB PBS home: /var/spool/torque Default server: serverhost Unix Domain sockets: no Tcl: -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm Tk: -L/usr/lib64 -ltk8.5 -L/usr/lib64 -lX11 -lXss -lXext -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm Ready for 'make'. -------- I do notice that your packages are a bit out of date. These are the latest: % rpm -qa | egrep '^t(cl|k)-' tcl-8.5.5-2.81 tk-8.5.5-3.14.1 tcl-devel-8.5.5-2.81 tk-devel-8.5.5-3.14.1 Perhaps you are hitting a bug in the packages you have. Otherwise, I'd posit that something has gone awry on your system in in your environment. HTH, Gabe -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From gabe at msi.umn.edu Thu Sep 8 10:25:34 2011 From: gabe at msi.umn.edu (Gabe Turner) Date: Thu, 8 Sep 2011 11:25:34 -0500 Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <547459510.5942.1315498666092.JavaMail.root@mailhost.coc.ufrj.br> References: <573533145.5940.1315498554954.JavaMail.root@mailhost.coc.ufrj.br> <547459510.5942.1315498666092.JavaMail.root@mailhost.coc.ufrj.br> Message-ID: <20110908162534.GH21637@blackice.msi.umn.edu> On Thu, Sep 08, 2011 at 01:17:46PM -0300, Albino A. Aveleda wrote: > I updated the tk packages. > # rpm -qa | egrep '^t(cl|k)-' > tk-8.5.5-3.14.1 > tk-devel-8.5.5-3.14.1 > tcl-devel-8.5.5-2.81 > tcl-8.5.5-2.81 > > And now I have the same version that you. I tried > to compile again. > > # ./configure --enable-gui > [... skip ...] > checking for Tk_Init... no > configure: Your Tk install is broken. Disabling Tk support. > checking whether to include the GUI-clients... configure: error: cannot build GUI without Tk library > > But I received the same error. > > Do you have some sugestion? Open up config.log and search for 'Tk_Init'. Hopefully there is an error in the lines following "checking for TK_Init". This is what it looks like if it's working properly: configure:33373: checking for Tk_Init configure:33430: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE conftest.c -L/usr/lib64 -ltk8.5 -L/usr/lib64 -lX11 -lXss -lXext -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm >&5 configure:33436: $? = 0 It's linking in a number of X libraries, too. perhaps you are missing one or more of them. -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From m.meinke at aia.rwth-aachen.de Fri Sep 9 05:45:26 2011 From: m.meinke at aia.rwth-aachen.de (Matthias Meinke) Date: Fri, 09 Sep 2011 13:45:26 +0200 Subject: [torqueusers] Maui ignores resource -l proc Message-ID: <201109091345.26255.m.meinke@aia.rwth-aachen.de> Hi, I am pretty new to torque and maui, so I hope this is not a dumb question: We are running a cluster with arounf 60 nodes and 400 cores . I have set up torque-3.0.2 with maui-3.3.1 for a cluster with 3 queues. Some users would like to specify the number of cores with the -l procs= directive, since they do not care how many cores they get on the individual nodes. When such a job with a specification of -l procs is submitted, I observe the following behaviour with maui, e.g.: qsub -I -l procs=32 output from tracejob -v Resource_List.neednodes=2:ppn=8 Resource_List.nodect=2 Resource_List.nodes=2:ppn=8 Resource_List.procs=32 Allocated are 2 nodes with 8 cores each = 16 procs where the number for Resource_List.neednodes=2:ppn=8 is taken from resources_default.nodes=2:ppn=8 which I have defined in the torque queue. When I change that value for resources_default.nodes maui just uses that value for the node and core allocation, so it seems that -l procs is completely ignored and instead maui uses the default resources defined for the queue. If I am using pbs_sched instead of maui there is no such problem and the correct number of cores are allocated to the job when specifying -l procs=32. The maui config file is just the standard file coming from the distribution with very little changes such as: SERVERHOST aia256 RMCFG[AIA256] TYPE=PBS NODEALLOCATIONPOLICY MINRESOURCE USERCFG[DEFAULT] MAXPROC=192 I am grateful for any help... Thanks Matthias From bino at coc.ufrj.br Fri Sep 9 06:45:28 2011 From: bino at coc.ufrj.br (Albino A. Aveleda) Date: Fri, 9 Sep 2011 09:45:28 -0300 (BRT) Subject: [torqueusers] enable-gui problem on SLES11-SP1 In-Reply-To: <776520853.6902.1315572282202.JavaMail.root@mailhost.coc.ufrj.br> Message-ID: <1389167482.6904.1315572328737.JavaMail.root@mailhost.coc.ufrj.br> Hi Gabe, Lacked the xorg-x11-devel package. It?s working now. Thank you for your help. Best regards, Bibo ----- Mensagem original ----- De: "Gabe Turner" Para: torqueusers at supercluster.org Enviadas: Quinta-feira, 8 de Setembro de 2011 13:25:34 Assunto: Re: [torqueusers] enable-gui problem on SLES11-SP1 On Thu, Sep 08, 2011 at 01:17:46PM -0300, Albino A. Aveleda wrote: > I updated the tk packages. > # rpm -qa | egrep '^t(cl|k)-' > tk-8.5.5-3.14.1 > tk-devel-8.5.5-3.14.1 > tcl-devel-8.5.5-2.81 > tcl-8.5.5-2.81 > > And now I have the same version that you. I tried > to compile again. > > # ./configure --enable-gui > [... skip ...] > checking for Tk_Init... no > configure: Your Tk install is broken. Disabling Tk support. > checking whether to include the GUI-clients... configure: error: cannot build GUI without Tk library > > But I received the same error. > > Do you have some sugestion? Open up config.log and search for 'Tk_Init'. Hopefully there is an error in the lines following "checking for TK_Init". This is what it looks like if it's working properly: configure:33373: checking for Tk_Init configure:33430: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE conftest.c -L/usr/lib64 -ltk8.5 -L/usr/lib64 -lX11 -lXss -lXext -L/usr/lib64 -ltcl8.5 -ldl -lieee -lm >&5 configure:33436: $? = 0 It's linking in a number of X libraries, too. perhaps you are missing one or more of them. -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From scrusan at ur.rochester.edu Fri Sep 9 10:03:44 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Fri, 9 Sep 2011 12:03:44 -0400 Subject: [torqueusers] NVIDIA GPUs version error In-Reply-To: <7aaf3c88-7e62-4953-9a07-3c73f6edc301@mail> References: <7aaf3c88-7e62-4953-9a07-3c73f6edc301@mail> Message-ID: <4BE4C875-97D3-4580-A893-AE6F2D8D773A@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Al, Thanks for the response. I found what you are talking about in the src/resmom/mom_server.c Our GPUs are working fine, so I'll wait for the newest release, and then move from there. ~Steve On Sep 7, 2011, at 2:48 PM, Al Taufer wrote: > We have only tested Torque using the 260 and 270 Nvidia Drivers so driver versions greater than 270 are not yet recognized. I am in the process of testing with the 275 and 280 drivers and hope to update Torque this week so it will accept any driver version greater than 260. There were major changes between the 260 and 270 driver versions and we should be okay with future driver releases as long as the Nvidia interface does not change. > > Al > > ----- Original Message ----- >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hi all, >> >> I'm getting errors in my syslog from our gpu nodes pbs_moms: >> >> Aug 22 15:55:09 blugpu07 pbs_mom: LOG_ERROR::a system error occured >> (15205) in generate_server_gpustatus_smi, Unknown Nvidia driver >> version >> >> Here is the snipped output of pbsnodes blugpu07: >> >> gpu_status = >> gpu[1]=gpu_id=0:15:0;,gpu[0]=gpu_id=0:14:0;,driver_ver=275.09.07,timestamp=Mon >> Aug 22 15:56:41 2011 >> >> >> If I login to the node, and check the pbs_mom logfiles, I see the >> following: >> >> 08/22/2011 15:57:24;0002; >> pbs_mom;n/a;mom_server_all_update_gpustat;composing gpu status >> update for server >> 08/22/2011 15:57:24;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::gpus, gpus: >> GPU cmd issued: nvidia-smi -a -x 2>&1 >> 08/22/2011 15:57:26;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::a system >> error occured (15205) in generate_server_gpustatus_smi, Unknown >> Nvidia driver versio n >> 08/22/2011 15:57:26;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::a system >> error occured (15205) in generate_server_gpustatus_smi, Unknown >> Nvidia driver versio n >> 08/22/2011 15:57:26;0002; >> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: >> sending to server "timestamp=Mon Aug 22 15:57:26 2011" >> 08/22/2011 15:57:26;0002; >> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: >> sending to server "driver_ver=275.09.07" >> 08/22/2011 15:57:26;0002; >> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: >> sending to server "gpuid=0:14:0" >> 08/22/2011 15:57:26;0002; >> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: >> sending to server "gpuid=0:15:0" >> 08/22/2011 15:57:26;0002; >> pbs_mom;n/a;mom_server_update_gpustat;status update successfully >> sent to bhsn-int >> >> >> Is this driver version we have not supported by torque? >> >> >> >> Environment: >> - TORQUE-2.5.6 >> - NVIDIA Driver Version : 275.09.07 >> - kernel: 2.6.18-238.12.1.el5 >> >> - TORQUE client was build via: >> This build was configured with: '''--prefix=/opt/torque/2.5.6' >> '--exec-prefix=/opt/torque/2.5.6/x86_64' >> '--with-server-home=/var/spool/pbs' '--enable-syslog' '--with-scp' >> '--disable-rpp' '--disable-spool' '--with-pam' '--with-cpusets' >> '--with-geometry-requests' '--disable-gui' '--enable-nvidia-gpus' >> '--enable-docs' >> >> >> >> ---------------------- >> Steve Crusan >> System Administrator >> Center for Research Computing >> University of Rochester >> https://www.crc.rochester.edu/ >> >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >> Comment: GPGTools - http://gpgtools.org >> >> iQEcBAEBAgAGBQJOUrawAAoJENS19LGOpgqKwkoIAIQrY8rZn+J+vaSgnTElGxvu >> KcMYlqkiBBZtix7YBCVMsHTv5PcOPT/4l1qHX4/7/P9ZW6Xc542LNKLJrd46FcLa >> cmbkixUaGRJ5SDCVSyA6YzZZIBDHBjP3AMrIouDwjyOEhR3A9agI5yYPdFTRdcNQ >> NoagT372lZnhVfPUYrVLM8oVIbS+KsZZGiYA4HShsbPUB/qqU/YqNroLlg7o8lVX >> gHBY7C231TpC/YAJx1xZ5qjSSl1/mtzK8PuzqZ5mWBFtoXFvlzXFe+C0uqcCHLh2 >> jjkGeRU09YCkHEuqJy+iQ/KDGgvAFSmyuDgWq3RPJX8c7xw+y7saDLjhH9vPdVg= >> =zdfO >> -----END PGP SIGNATURE----- >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOajjlAAoJENS19LGOpgqKEbsH/2v4C6yglvkpVgvDOPTnr9Ud DbygLsflOBypKnD/tJ7yenK65eBhq2P5cDr4qMtOyAya+gs7g2NlTVz4x7skgmFF mGTy0FKgsUf9rk9LcQZIfeFIle+l5T9TpHGN0+fvF0zhrO7hTreQrtzCw1tm7WsB 0KzpF702c1RL8gLX7OVxE9t5NG9i1eadTEVlRpK6/5eNVVGQYP3P6nEmuDieUy2r L3MYRLG6/v5AbBED1QztcMuuvUs4zBPwh0k4ItmpISwHDht6/YXOquRD2Yr17Q9S S3qNfNBV6AlKlqaeDfQHYh6RGpfolRtmixdEFRxu4sbZ8fndoB4kBJ78qXVd744= =mhIq -----END PGP SIGNATURE----- From j.blank at fz-juelich.de Fri Sep 9 10:57:56 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Fri, 9 Sep 2011 16:57:56 +0000 (UTC) Subject: [torqueusers] DRMAA problems Message-ID: Hello, I recently converted our cluster from SGE to Torque. We have some scripts which use DRMAA to launch and wait for jobs and I'm currently trying to adapt those. Unfortunately I encountered some problems with the Torque drmaa library. Some parts of the job options are not accepted by Torque. Especially the whole implicite working directory settings are missing (automatically set output path to cwd). It also does not seem to accept joinFiles and nativeSpecification. I usually use DRMAA via python-drmaa but the tests included in torque also fail. I'll include a example "qstats -f" output where I set explicitly output and error path via drmaa. I also tried the pbs-drmaa library, which shows the same behaviour. We are using Torque 2.5.8. Regards, Joerg Blank Job Id: 175.cluster Job_Name = sleeper Job_Owner = user at cluster job_state = Q queue = batch server = cluster Checkpoint = u ctime = Fri Sep 9 18:42:55 2011 Error_Path = /home/user/mpidebug/sleeper.e175 Hold_Types = n Join_Path = n Keep_Files = n mtime = Fri Sep 9 18:42:55 2011 Output_Path = /home/user/mpidebug/sleeper.o175 Priority = 0 qtime = Fri Sep 9 18:42:55 2011 Rerunable = True Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.pmem = 2gb Resource_List.walltime = 01:00:00 Shell_Path_List = /bin/sh substate = 10 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=cluster euser = user egroup = users queue_rank = 1345 queue_type = E etime = Fri Sep 9 18:42:55 2011 From knielson at adaptivecomputing.com Fri Sep 9 14:11:47 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 09 Sep 2011 14:11:47 -0600 (MDT) Subject: [torqueusers] File staging syntax In-Reply-To: <9596036d-ff12-43a6-a8ce-ec19ca705051@zimbra.scai.fraunhofer.de> Message-ID: <049df4e4-f6e1-4be6-bd67-b57df93df0e3@mail> ----- Original Message ----- > From: "Andr? Gem?nd" > To: torqueusers at supercluster.org > Sent: Thursday, September 8, 2011 1:59:22 AM > Subject: [torqueusers] File staging syntax > > Hello everyone, > > is it possible that the -W syntax changed again between 2.5.5 and > 2.5.8? We were using 2.5.5 without problems, but since I upgraded to > 2.5.8 yesterday, PBS scripts with more than one file per staging > line failed with "illegal -W syntax". I had to change the scripts to > use seperate -W lines for every file. I didn't see this in the > changelog, or maybe I just missed it? > > Greetings > Andr? > > > -- > Andr? Gem?nd > Fraunhofer-Institute for Algorithms and Scientific Computing > andre.gemuend at scai.fraunhofer.de > Tel: +49 2241 14-2193 > /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend Andre Can you send your qsub or msub line? Can you send your script as well? Regards Ken Nielson Adaptive Computing From knielson at adaptivecomputing.com Fri Sep 9 14:53:03 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 09 Sep 2011 14:53:03 -0600 (MDT) Subject: [torqueusers] torque :: configuration problems on localhost In-Reply-To: <4E6684C9.8040500@ldeo.columbia.edu> Message-ID: Adrian, You also need to make sure you have a queue created. It will need to be an execution queue with the started and enabled options set to true. Did you run torque.setup after you installed? Ken ----- Original Message ----- > From: "Gus Correa" > To: "Torque Users Mailing List" > Sent: Tuesday, September 6, 2011 2:38:33 PM > Subject: Re: [torqueusers] torque :: configuration problems on localhost > > PS - You may also need a ${TORQUE}/server_priv/nodes file, > if not yet there: > http://www.adaptivecomputing.com/resources/docs/torque/1.5nodeconfig.php > Maybe also a ${TORQUE}/mom_priv/config file with a line like this: > $pbsserver 'your pbs_server name' > > > Gus Correa wrote: > > Hi Adrian > > > > For what it is worth, here we have the server FQDN name > > in ${TORQUE}/sever_name. The acl and other Torque > > configuration parameters point to that FQDN (not to localhost, > > which is the loopback interface name in /etc/hosts, right?) > > > > I hope this helps. > > Gus Correa > > > > Adrian Sevcenco wrote: > >> Hi! I try to install torque on a simple 8 core computer .. > >> i have a simple queue and simple configuration .. > >> the thing is that i keep receiving: > >> > >> set server managers = root at localhost > >> qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad > >> host: > >> localhost > >> > >> Any idea about the problem? > >> Thanks! > >> > >> > >> OS : Fedora 14, torque 2.4.11 > >> > >> configuration: > >> > >> create queue alice > >> set queue alice queue_type = Execution > >> set queue alice max_running = 8 > >> set queue alice max_queuable = 2000 > >> set queue alice resources_max.cput = 24:00:00 > >> set queue alice resources_max.walltime = 48:00:00 > >> set queue alice enabled = True > >> set queue alice started = True > >> > >> # > >> # Set server attributes. > >> # > >> > >> ## DEFAULT QUEUE > >> set server default_queue = alice > >> > >> # seting submittins hosts > >> set server scheduling = True > >> # set server acl_host_enable = True > >> > >> set server submit_hosts = localhost > >> set server acl_hosts = localhost > >> > >> set server managers = root at localhost > >> set server managers += asevcenc at localhost > >> > >> set server operators = root at localhost > >> set server operators += asevcenc at localhost > >> > >> set server allow_node_submit = True > >> > >> set server log_events = 511 > >> set server mail_from = torque at sev2.spacescience.ro > >> set server query_other_jobs = True > >> set server resources_default.walltime = 72:00:00 > >> set server scheduler_iteration = 600 > >> set server node_ping_rate = 60 > >> set server node_check_rate = 150 > >> set server tcp_timeout = 20 > >> set server node_pack = True > >> set server poll_jobs = True > >> > >> > >> > >> > >> > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From blake.wickliffe at aramco.com Fri Sep 9 22:42:20 2011 From: blake.wickliffe at aramco.com (Wickliffe, Blake W) Date: Sat, 10 Sep 2011 04:42:20 +0000 Subject: [torqueusers] Strange problem in Torque 2.5.7+ Message-ID: <09D3C16B37878C44837F749DB16ACF19054BAE@DHA00730-MSXP03.aramco.com> Hello, We've been bitten by a strange problem twice now in Torque, so I thought I'd check to see if anyone else has run into it. We are running Torque 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon hangs. All qstat or pbsnodes commands fail. The process is still in memory but it drops to 0% CPU utilization. Restarting the pbs_server allows it to come back up for a few seconds but then it hangs again. If I clear out all the jobs in the "jobs" directory and restart the server it comes back up fine. The last time this happened, I was able to move jobs back into the directory a few at a time and keep restarting the pbs_server until I isolated the few jobs that were causing the server to hang. Checking the files, all of these jobs were running on two nodes that had crashed. So, in essence, a pbs_mom node crashed and took down the entire cluster with it. As I said, we've seen this happen twice now. Has anyone else seen this? Regards, Blake Wickliffe Saudi Aramco ENOD/CSYS/USG HPC Team (873-4417) ________________________________ The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110910/e09784d1/attachment.html From nt_mahmood at yahoo.com Sat Sep 10 23:46:29 2011 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sat, 10 Sep 2011 22:46:29 -0700 (PDT) Subject: [torqueusers] Job is in 'Q' but checkjob shows it is running (!) Message-ID: <1315719989.32315.YahooMailNeo@web111717.mail.gq1.yahoo.com> Hi, Can someone explain why the qstat shows a job in "Q" but checkjob says everything is normal? mahmood at srv1:416.gamess$ qstat 49003 Job id??????????????????? Name???????????? User??????????? Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 49003.srv1???????????????? gamess?????????? mahmood??????????????? 0 Q Long mahmood at srv1:416.gamess$ checkjob 49003 checking job 49003 State: Idle Creds:? user:mahmood? group:mahmood? class:Long??? qos:DEFAULT WallTime: 00:00:00 of 40:00:00:00 SubmitTime: Sun Sep 11 09:51:26 ? (Time Queued? Total: 00:02:36? Eligible: 00:02:36) Total Tasks: 1 Req[0]? TaskCount: 1? Partition: ALL Network: [NONE]? Memory >= 0? Disk >= 0? Swap >= 0 Opsys: [NONE]? Arch: [NONE]? Features: [NONE] IWD: [NONE]? Executable:? [NONE] Bypass: 0? StartCount: 0 PartitionMask: [ALL] Flags:?????? HOSTLIST RESTARTABLE HostList: ? [hawk:1] PE:? 1.00? StartPriority:? 129 job can run in partition DEFAULT (3 procs available.? 1 procs required) Thanks ? // Naderan *Mahmood; From mamonski at man.poznan.pl Sun Sep 11 18:15:27 2011 From: mamonski at man.poznan.pl (=?UTF-8?Q?Mariusz_Mamo=C5=84ski?=) Date: Mon, 12 Sep 2011 02:15:27 +0200 Subject: [torqueusers] DRMAA problems In-Reply-To: References: Message-ID: On 9 September 2011 18:57, Joerg Blank wrote: > Hello, > > I recently converted our cluster from SGE to Torque. We have some scripts which > use DRMAA to launch and wait for jobs and I'm currently trying to adapt those. > > Unfortunately I encountered some problems with the Torque drmaa library. Some > parts of the job options are not accepted by Torque. Especially the whole > implicite working directory settings are missing (automatically set output path > to cwd). It also does not seem to accept joinFiles and nativeSpecification. > > I usually use DRMAA via python-drmaa but the tests included in torque also fail. > > I'll include a example "qstats -f" output where I set explicitly output and > error path via drmaa. > > I also tried the pbs-drmaa library, which shows the same behaviour. We are using > Torque 2.5.8. Could you try the newest 1.0.9 version from source forge: https://sourceforge.net/projects/pbspro-drmaa/ ? This version should fix this problem. Also the nativeSpecifcation should work. See: http://apps.man.poznan.pl/trac/pbs-drmaa/wiki/WikiStart#Nativespecification > > Regards, > Joerg Blank > > Job Id: 175.cluster > ? ?Job_Name = sleeper > ? ?Job_Owner = user at cluster > ? ?job_state = Q > ? ?queue = batch > ? ?server = cluster > ? ?Checkpoint = u > ? ?ctime = Fri Sep ?9 18:42:55 2011 > ? ?Error_Path = /home/user/mpidebug/sleeper.e175 > ? ?Hold_Types = n > ? ?Join_Path = n > ? ?Keep_Files = n > ? ?mtime = Fri Sep ?9 18:42:55 2011 > ? ?Output_Path = /home/user/mpidebug/sleeper.o175 > ? ?Priority = 0 > ? ?qtime = Fri Sep ?9 18:42:55 2011 > ? ?Rerunable = True > ? ?Resource_List.neednodes = 1 > ? ?Resource_List.nodect = 1 > ? ?Resource_List.nodes = 1 > ? ?Resource_List.pmem = 2gb > ? ?Resource_List.walltime = 01:00:00 > ? ?Shell_Path_List = /bin/sh > ? ?substate = 10 > ? ?Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=cluster > ? ?euser = user > ? ?egroup = users > ? ?queue_rank = 1345 > ? ?queue_type = E > ? ?etime = Fri Sep ?9 18:42:55 2011 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > BR, -- Mariusz From j.blank at fz-juelich.de Mon Sep 12 05:20:58 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Mon, 12 Sep 2011 13:20:58 +0200 Subject: [torqueusers] DRMAA problems In-Reply-To: References: Message-ID: Am 12.09.2011 02:15, schrieb Mariusz Mamo?ski: > Could you try the newest 1.0.9 version from source forge: Works now. Thanks! Regards, Joerg Blank From charles.johnson at accre.vanderbilt.edu Mon Sep 12 06:39:01 2011 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Mon, 12 Sep 2011 07:39:01 -0500 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <09D3C16B37878C44837F749DB16ACF19054BAE@DHA00730-MSXP03.aramco.com> References: <09D3C16B37878C44837F749DB16ACF19054BAE@DHA00730-MSXP03.aramco.com> Message-ID: On Sep 9, 2011, at 11:42 PM, Wickliffe, Blake W wrote: > We?ve been bitten by a strange problem twice now in Torque, so I thought I?d check to see if anyone else has run into it. We are running Torque 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon hangs. All qstat or pbsnodes commands fail. The process is still in memory but it drops to 0% CPU utilization. > > Restarting the pbs_server allows it to come back up for a few seconds but then it hangs again. If I clear out all the jobs in the ?jobs? directory and restart the server it comes back up fine. The last time this happened, I was able to move jobs back into the directory a few at a time and keep restarting the pbs_server until I isolated the few jobs that were causing the server to hang. Checking the files, all of these jobs were running on two nodes that had crashed. +1 Yes, we experience the same problem with the same symptoms. It takes me a while to track down the offending nodes. We Have upgraded to 2.5.8 without relief. We use moab 6.0.1 as well. ~Charles~ -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research & Education Mailing Address: Peabody #34, 230 Appleton Place, Nashville, TN 37203 Shipping Address: 1231 18th Avenue South, Hill Center, Suite 143, Nashville, TN 37212 Office: 615-343-4134 Cell: 615-478-7788 Fax: 615-343-7216 charles.johnson at accre.vanderbilt.edu From arnaubria at pic.es Mon Sep 12 08:13:00 2011 From: arnaubria at pic.es (Arnau Bria) Date: Mon, 12 Sep 2011 16:13:00 +0200 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: References: <09D3C16B37878C44837F749DB16ACF19054BAE@DHA00730-MSXP03.aramco.com> Message-ID: <20110912161300.1de4fe50@amarrosa.pic.es> Hi, I had similar issue with older version (2.5.6) Could it be the tcp-retry-limit ? e - Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on pbs_server. We recommend --with-tcp-retry-limit=2 (backported from 3.0.1) Cheers, Arnau From soubari at yahoo.com Mon Sep 12 08:45:24 2011 From: soubari at yahoo.com (sam oubari) Date: Mon, 12 Sep 2011 07:45:24 -0700 (PDT) Subject: [torqueusers] Help! One Puzzle At a Time... * update * Message-ID: <1315838724.22667.YahooMailNeo@web110602.mail.gq1.yahoo.com> Hello, ? A quick update to my prior posts.? I am using 2.5.6 on a local host with pbs_sched. To recap, a frequent job (one that re-sub itself every 10 minutes and runs the same static script).? About twice a week, it malfunctions?by?moving from?W?to Q?but does not get to R state, then?I have to manually release with QRUN.? I just noticed a msg like one below gets logged in server_logs: ? 09/09/2011 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified?at request of rpt_prod at naboo.linnbenton.edu mailto:rpt_prod at naboo.linnbenton.edu In the logs, jobs normally get modified but usually by the scheduler not the job owner.? I did not see any related msgs in /var/log. ?Any hints for me? ? Thank you! Sam. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110912/326496db/attachment.html From JTRUTWIN at CSBSJU.EDU Mon Sep 12 14:19:01 2011 From: JTRUTWIN at CSBSJU.EDU (Trutwin, Joshua) Date: Mon, 12 Sep 2011 20:19:01 +0000 Subject: [torqueusers] Torque rpm build headaches Message-ID: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> Hi - I just setup torque for a single compute node (currently). I'm having some trouble with the make rpm command passing all my configure params to the rpmbuild. Here's what I did: ./configure --prefix=/opt/torque-2.5.8 --disable-rpp --with-default-server=torque.csbsju.edu When I do this and run: make rpm In the output I see this: ./configure --host=x86_64-unknown-linux-gnu --build=x86_64-unknown-linux-gnu --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --includedir=/usr/include/torque --with-default-server=torque.csbsju.edu --with-server-home=/var/spool/torque --with-sendmail=/usr/sbin/sendmail --disable-dependency-tracking --disable-gui --without-tcl --with-rcp=scp --enable-syslog --disable-gcc-warnings --disable-munge-auth --without-pam --disable-drmaa --enable-high-availability --disable-qsub-keep-override --disable-blcr --disable-cpuset --enable-spool --enable-server-xml The --with-default-server was passed but not the --disable-rpp or --prefix=/opt/torque-2.5.8. And --disable-gui is passed in even though it's enabled when running the ./configure on it's own. I can get around these by running: make RPMOPTS="--with gui --without rpp" rpm I can't seem to change the prefix this way though, so what I do here is open torque.spec and add this line: %define _prefix /opt/torque-2.5.8 Which gets wiped out if I reconfigure. At this point if I run the make RPMOPTS... line above it gives me a configure that I can work with for my needs: ./configure --host=x86_64-unknown-linux-gnu --build=x86_64-unknown-linux-gnu --program-prefix= --prefix=/opt/torque-2.5.8 --exec-prefix=/opt/torque-2.5.8 --bindir=/opt/torque-2.5.8/bin --sbindir=/opt/torque-2.5.8/sbin --sysconfdir=/etc --datadir=/opt/torque-2.5.8/share --includedir=/opt/torque-2.5.8/include --libdir=/opt/torque-2.5.8/lib64 --libexecdir=/opt/torque-2.5.8/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/torque-2.5.8/share/man --infodir=/opt/torque-2.5.8/share/info --includedir=/opt/torque-2.5.8/include/torque --with-default-server=torque.csbsju.edu --with-server-home=/var/spool/torque --with-sendmail=/opt/torque-2.5.8/sbin/sendmail --disable-dependency-tracking --enable-gui --with-tcl --with-rcp=scp --enable-syslog --disable-gcc-warnings --disable-munge-auth --without-pam --disable-drmaa --enable-high-availability --disable-qsub-keep-override --disable-blcr --disable-cpuset --enable-spool --enable-server-xml Unfortunately with this though now it's picking a nonexistent sendmail so I added -with-sendmail=/usr/sbin/sendmail to configure the sendmail path but it still uses /opt/torque-2.5.8/sbin/sendmail so I have to add a symlink here after installing the rpms... Can this be simplified at all so it works like it says it should from the INSTALL doc please: TORQUE has built-in support for making RPMs. After running ./configure with all desired options, 'make rpm' should create a set of binary RPMs that match your configuration. I take "match your configuration" to mean that it'll keep all my original ./configure options. Thanks! Josh From Adrian.Sevcenco at cern.ch Mon Sep 12 15:00:49 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Tue, 13 Sep 2011 00:00:49 +0300 Subject: [torqueusers] torque :: configuration problems on localhost In-Reply-To: References: Message-ID: <4E6E7301.30308@cern.ch> On 09/09/11 23:53, Ken Nielson wrote: > Adrian, > > You also need to make sure you have a queue created. It will need to > be an execution queue with the started and enabled options set to > true. > > Did you run torque.setup after you installed? no, i make the settings by hand .. anyway now is everything ok .. it seems that was a problem of acls.. i deleted all acls and let only : set server acl_hosts = localhost Now it it ok .. i can schedule jobs and jobs are running .. Thanks, Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3110 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20110913/2f8563fc/attachment.bin From knielson at adaptivecomputing.com Mon Sep 12 15:06:56 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 12 Sep 2011 15:06:56 -0600 (MDT) Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <20110912161300.1de4fe50@amarrosa.pic.es> Message-ID: <63e183f6-e89d-4600-bf5f-f4d5169e5da1@mail> ----- Original Message ----- > From: "Arnau Bria" > To: torqueusers at supercluster.org > Sent: Monday, September 12, 2011 8:13:00 AM > Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ > > Hi, > > > I had similar issue with older version (2.5.6) Could it be the > tcp-retry-limit ? > > e - Add a configure option --with-tcp-retry-limit to prevent > potential 4+ hour hangs on > pbs_server. We recommend --with-tcp-retry-limit=2 (backported > from 3.0.1) > > Cheers, > Arnau Do any of you have TORQUE log files from when the server hung? Ken From charles.johnson at accre.vanderbilt.edu Mon Sep 12 15:31:44 2011 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Mon, 12 Sep 2011 16:31:44 -0500 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <63e183f6-e89d-4600-bf5f-f4d5169e5da1@mail> References: <63e183f6-e89d-4600-bf5f-f4d5169e5da1@mail> Message-ID: <092F76AE-F67B-49D3-9BF7-4F07A9546E8D@accre.vanderbilt.edu> Next time it happens (usually once in a day) I can send you a portion of the log file. They are generally about 1GB in size (we have upped the logging trying to get information). Any idea what you might be looking for, or how much of the file you might need? ~Charles~ On Sep 12, 2011, at 4:06 PM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Arnau Bria" >> To: torqueusers at supercluster.org >> Sent: Monday, September 12, 2011 8:13:00 AM >> Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ >> >> Hi, >> >> >> I had similar issue with older version (2.5.6) Could it be the >> tcp-retry-limit ? >> >> e - Add a configure option --with-tcp-retry-limit to prevent >> potential 4+ hour hangs on >> pbs_server. We recommend --with-tcp-retry-limit=2 (backported >> from 3.0.1) >> >> Cheers, >> Arnau > > Do any of you have TORQUE log files from when the server hung? > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Mon Sep 12 15:37:10 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 12 Sep 2011 15:37:10 -0600 (MDT) Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <092F76AE-F67B-49D3-9BF7-4F07A9546E8D@accre.vanderbilt.edu> Message-ID: <2232a38b-0730-4156-8669-bd24796bb0c2@mail> ----- Original Message ----- > From: "Charles Johnson" > To: "Torque Users Mailing List" > Sent: Monday, September 12, 2011 3:31:44 PM > Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ > > Next time it happens (usually once in a day) I can send you a portion > of the log file. They are generally about 1GB in size (we have upped > the logging trying to get information). Any idea what you might be > looking for, or how much of the file you might need? > > ~Charles~ The pertinent information will likely be in the last 5 or 10 minutes before the server hung. We can start with that. If we need more we can get that later. Thanks Ken From charles.johnson at accre.vanderbilt.edu Mon Sep 12 15:48:20 2011 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Mon, 12 Sep 2011 16:48:20 -0500 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <2232a38b-0730-4156-8669-bd24796bb0c2@mail> References: <2232a38b-0730-4156-8669-bd24796bb0c2@mail> Message-ID: Next time it happen I will send it to you. It maybe later this evening, or early morning, but you will get it. I appreciate your looking at it. Thanks! ~Charles~ On Sep 12, 2011, at 4:37 PM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Charles Johnson" >> To: "Torque Users Mailing List" >> Sent: Monday, September 12, 2011 3:31:44 PM >> Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ >> >> Next time it happens (usually once in a day) I can send you a portion >> of the log file. They are generally about 1GB in size (we have upped >> the logging trying to get information). Any idea what you might be >> looking for, or how much of the file you might need? >> >> ~Charles~ > > The pertinent information will likely be in the last 5 or 10 minutes before the server hung. We can start with that. If we need more we can get that later. > > Thanks > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research & Education Mailing Address: Peabody #34, 230 Appleton Place, Nashville, TN 37203 From mej at lbl.gov Mon Sep 12 16:03:04 2011 From: mej at lbl.gov (Michael Jennings) Date: Mon, 12 Sep 2011 15:03:04 -0700 Subject: [torqueusers] Torque rpm build headaches In-Reply-To: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> References: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> Message-ID: <20110912220303.GQ10643@lbl.gov> On Monday, 12 September 2011, at 20:19:01 (+0000), Trutwin, Joshua wrote: > Hi - I just setup torque for a single compute node (currently). I'm > having some trouble with the make rpm command passing all my > configure params to the rpmbuild. Its goal is not to pass *all* your configure parameters to the rpmbuild process because doing so does not make sense. At present, it only passes those arguments for which the spec file has conditional build support or macro variables in it. This includes things like DRMAA, SCP, syslog, PAM, and (as you saw) the server name and path. RPP is not currently supported, but I hope to add support for more conditionals as time progresses. That said, since I don't use a lot of the conditionals torque supports, I can't test many of them. Thus, community contributions in the way of patches are most appreciated. :-) > The --with-default-server was passed but not the --disable-rpp or > --prefix=/opt/torque-2.5.8. And --disable-gui is passed in even > though it's enabled when running the ./configure on it's own. Unfortunately "gui" is not one of those yet supported by "make rpm" and is disabled by default in the spec, which explains the behavior you saw. You're right; this does need to be fixed. > I can get around these by running: > > make RPMOPTS="--with gui --without rpp" rpm > > I can't seem to change the prefix this way though, so what I do here is open torque.spec and add this line: > > %define _prefix /opt/torque-2.5.8 > > Which gets wiped out if I reconfigure. Did you try: make RPMOPTS="--with gui --without rpp --define '_prefix /opt/torque-2.5.8'" rpm > Unfortunately with this though now it's picking a nonexistent > sendmail so I added -with-sendmail=/usr/sbin/sendmail to configure > the sendmail path but it still uses /opt/torque-2.5.8/sbin/sendmail > so I have to add a symlink here after installing the rpms... This, too, can be specified in RPMOPTS as "--define 'sendmail_path /usr/sbin/sendmail'" > Can this be simplified at all so it works like it says it should > from the INSTALL doc please: > > TORQUE has built-in support for making RPMs. After running > ./configure with all desired options, 'make rpm' should create a > set of binary RPMs that match your configuration. > > I take "match your configuration" to mean that it'll keep all my > original ./configure options. Certain things are not currently supported because I am unable to test or haven't had time to support them all. Some are legitimate bugs. And some things aren't supported because they aren't standard practice and/or are generally unwise. Replicating the GNU configure prefix (default /usr/local) to the RPM prefix (default determined by build host RPM configuration, usually /usr) by default is one of those things that falls into the latter category and should only be done explicitly and with sufficient care and forethought. :-) It should NEVER keep "all original ./configure options" because this would create unmaintainable messes (as has been seen previously). But hopefully over time the RPMOPTS mechanism can become more featureful to allow greater site-specific customizations of built packages without the previous problems with information leakage. HTH, Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From glen.beane at gmail.com Mon Sep 12 16:48:10 2011 From: glen.beane at gmail.com (Glen Beane) Date: Mon, 12 Sep 2011 18:48:10 -0400 Subject: [torqueusers] Torque rpm build headaches In-Reply-To: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> References: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> Message-ID: On Mon, Sep 12, 2011 at 4:19 PM, Trutwin, Joshua wrote: > > Hi - I just setup torque for a single compute node (currently). ?I'm having some trouble with the make rpm command passing all my configure params to the rpmbuild. > > Here's what I did: > > ./configure --prefix=/opt/torque-2.5.8 --disable-rpp --with-default-server=torque.csbsju.edu --disable-rpp doesn't really do much of anything (should only affect momctl in 2.5.8 -- possibly negatively), and really isn't advised anymore.? Here is the default "Garrick Staples" response circa 2006 Garrick Staples wrote many times: >Not in my opinion. The --disable-rpp option only effects Resource >Monitor requests (as when using momctl). Back in the OpenPBS days this >was more important because schedulers like maui had to issue RM requests >to every node. These days pbs_server is already providing that >information directly to schedulers. > >Without RPP, momctl must allocate a new socket and bind to a priviledged >port to talk to a MOM. When you talk to many MOMs quickly, you can >easily run out of priviledged ports to TIME_WAIT. > >With RPP, momctl works with one socket on one priviledged port. now I believe that the Adaptive Computing developers are working on removing the last vestiges of RPP from TORQUE. From toth at fi.muni.cz Tue Sep 13 03:05:33 2011 From: toth at fi.muni.cz (=?windows-1252?Q?=22Mgr=2E_=8Aimon_T=F3th=22?=) Date: Tue, 13 Sep 2011 11:05:33 +0200 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <09D3C16B37878C44837F749DB16ACF19054BAE@DHA00730-MSXP03.aramco.com> References: <09D3C16B37878C44837F749DB16ACF19054BAE@DHA00730-MSXP03.aramco.com> Message-ID: <4E6F1CDD.5020303@fi.muni.cz> > We?ve been bitten by a strange problem twice now in Torque, so I thought > I?d check to see if anyone else has run into it. We are running Torque > 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon > hangs. All qstat or pbsnodes commands fail. The process is still in > memory but it drops to 0% CPU utilization. > > > > Restarting the pbs_server allows it to come back up for a few seconds > but then it hangs again. If I clear out all the jobs in the ?jobs? > directory and restart the server it comes back up fine. The last time > this happened, I was able to move jobs back into the directory a few at > a time and keep restarting the pbs_server until I isolated the few jobs > that were causing the server to hang. Checking the files, all of these > jobs were running on two nodes that had crashed. > > > > So, in essence, a pbs_mom node crashed and took down the entire cluster > with it. As I said, we?ve seen this happen twice now. Has anyone else > seen this? The issue is that the dis_tcp_wflush() function can hang for a loooong time. The server will wait until all data are sent, which can be hours, if the other side is slow enough. Also this will hang until timeouts occur when the other side is dead. The patch I included is what we have done with the tcp_dis.c file. Sorry, that the patch isn't clean, but unfortunatelly, we have a lot of fixes and I don't really have the time to dig one specific out. Specificially ignore the GSSAPI (kerberos) stuff and concentrate on the dis_tcp_wflush() function. The alarms are the important stuff. We are using 60 seconds timeouts, you should tailor that towards your cluster, the most stuff that will get send here are qstat replies, which for 8000 jobs should be somewhere around 100MB of data. -- Mgr. Simon Toth -------------- next part -------------- A non-text attachment was scrubbed... Name: tcp.diff Type: text/x-patch Size: 5271 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20110913/dfc1ae0f/attachment-0001.bin From arnaubria at pic.es Tue Sep 13 03:31:59 2011 From: arnaubria at pic.es (Arnau Bria) Date: Tue, 13 Sep 2011 11:31:59 +0200 Subject: [torqueusers] building torque+munge rpm Message-ID: <20110913113159.442a7d04@amarrosa.pic.es> Hi all, I'm building torque 2.5.7 with munge. I use the command: rpmbuild -ta --define 'prefix /usr' --define 'torque_home /var/spool/pbs' --define 'acflags --enable-munge-auth --enable-maxdefault --with-readline --with-tcp-retry-limit=2 --with-rcp=scp --with-default-server=pbs03.pic.es ' torque-2.5.7.tar.gz but, when I see the configure that it does I see a strange --disable-munge-auth on it: + ./configure --host=x86_64-redhat-linux-gnu --build=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --includedir=/usr/include/torque --with-default-server=localhost --with-server-home=/var/spool/pbs --with-sendmail=/usr/sbin/sendmail --disable-dependency-tracking --disable-gui --without-tcl --with-rcp=scp --enable-syslog --disable-gcc-warnings --disable-munge-auth --without-pam --disable-drmaa --enable-high-availability --disable-qsub-keep-override --disable-blcr --disable-cpuset --enable-spool --enable-server-xml --enable-munge-auth --enable-maxdefault --with-readline --with-tcp-retry-limit=2 --with-rcp=scp --with-default-server=pbs03.pic.es I think this is not ok cause later, torque works without munge started... Is there any problem with that version and munge? or with my command? TIA, Arnau From arnaubria at pic.es Tue Sep 13 03:50:58 2011 From: arnaubria at pic.es (Arnau Bria) Date: Tue, 13 Sep 2011 11:50:58 +0200 Subject: [torqueusers] building torque+munge rpm In-Reply-To: <20110913113159.442a7d04@amarrosa.pic.es> References: <20110913113159.442a7d04@amarrosa.pic.es> Message-ID: <20110913115058.4fcd127b@amarrosa.pic.es> Ok... munge-devel is needed.... maybe a warning needed??? Cheers, Arnau From arnaubria at pic.es Tue Sep 13 04:13:14 2011 From: arnaubria at pic.es (Arnau Bria) Date: Tue, 13 Sep 2011 12:13:14 +0200 Subject: [torqueusers] building torque+munge rpm In-Reply-To: <20110913115058.4fcd127b@amarrosa.pic.es> References: <20110913113159.442a7d04@amarrosa.pic.es> <20110913115058.4fcd127b@amarrosa.pic.es> Message-ID: <20110913121314.5ddb22e7@amarrosa.pic.es> On Tue, 13 Sep 2011 11:50:58 +0200 Arnau Bria wrote: > Ok... > > munge-devel is needed.... Even with it, it still adds disable-munge-auth... so, I can't build torque with munge support. Anyone could give me a hand? Cheers, Arnau From blake.wickliffe at aramco.com Tue Sep 13 06:01:09 2011 From: blake.wickliffe at aramco.com (Wickliffe, Blake W) Date: Tue, 13 Sep 2011 12:01:09 +0000 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <20110912161300.1de4fe50@amarrosa.pic.es> References: <09D3C16B37878C44837F749DB16ACF19054BAE@DHA00730-MSXP03.aramco.com> <20110912161300.1de4fe50@amarrosa.pic.es> Message-ID: <09D3C16B37878C44837F749DB16ACF1905DBB1@DHA00730-MSXP03.aramco.com> Howdy, Hmm....that sounds likely. I'll wait to hear from Adaptive whether or not they recommend this solution or some sort of patch, like was also suggested. Thanks! Blake Wickliffe Saudi Aramco ENOD/CSYS/USG HPC Team (873-4417) -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Arnau Bria Sent: Monday, September 12, 2011 5:13 PM To: torqueusers at supercluster.org Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ Hi, I had similar issue with older version (2.5.6) Could it be the tcp-retry-limit ? e - Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on pbs_server. We recommend --with-tcp-retry-limit=2 (backported from 3.0.1) Cheers, Arnau ________________________________ The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as ?this Email?), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email. From andre.gemuend at scai.fraunhofer.de Tue Sep 13 06:54:43 2011 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Tue, 13 Sep 2011 14:54:43 +0200 (CEST) Subject: [torqueusers] File staging syntax In-Reply-To: <049df4e4-f6e1-4be6-bd67-b57df93df0e3@mail> Message-ID: Hello Ken, you just need two stagein or stageout files in one line: [andre at gloria pbs]$ cat pbstest #!/bin/bash #PBS -S /bin/bash #PBS -q local #PBS -W stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 #PBS -m n echo "foo" [andre at gloria pbs]$ qsub pbstest qsub: illegal -W value [andre at gloria pbs]$ cat pbstest #!/bin/bash #PBS -S /bin/bash #PBS -q local #PBS -W stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 #PBS -m n echo "foo" [andre at gloria pbs]$ qsub pbstest qsub: illegal -W value [andre at gloria pbs]$ cat pbstest #!/bin/bash #PBS -S /bin/bash #PBS -q local #PBS -W stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo #PBS -W stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 #PBS -m n echo "foo" [andre at gloria pbs]$ qsub pbstest 435688.tonia.d-grid.scai.fraunhofer.de So neither the old, nor the new syntax of specifying multiple files per -W works anymore. gLite CREAM (http://glite.cern.ch/glite-CREAM/) generates these lines with its wrapper script. So this is basically a bug in that software (which is easy to solve), but it would have been nice to be notified of the change. Greetings Andr? > Andre > > Can you send your qsub or msub line? > > Can you send your script as well? > > is it possible that the -W syntax changed again between 2.5.5 and > > 2.5.8? We were using 2.5.5 without problems, but since I upgraded > > to > > 2.5.8 yesterday, PBS scripts with more than one file per staging > > line failed with "illegal -W syntax". I had to change the scripts > > to > > use seperate -W lines for every file. I didn't see this in the > > changelog, or maybe I just missed it? -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From JTRUTWIN at CSBSJU.EDU Tue Sep 13 08:41:02 2011 From: JTRUTWIN at CSBSJU.EDU (Trutwin, Joshua) Date: Tue, 13 Sep 2011 14:41:02 +0000 Subject: [torqueusers] Torque rpm build headaches In-Reply-To: References: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> Message-ID: <710C58696EA3BC42B425E4DBB39C1D5E25C93F58@MAIL-MBX2.ad.csbsju.edu> > --disable-rpp doesn't really do much of anything (should only affect momctl > in 2.5.8 -- possibly negatively), and really isn't advised anymore.? Here is the > default "Garrick Staples" response circa 2006 Ok I will remove it from my build. Thanks, this is why I join ML's, to get the real dirt. :) > Garrick Staples wrote many times: > >Not in my opinion. The --disable-rpp option only effects Resource > >Monitor requests (as when using momctl). Back in the OpenPBS days this > >was more important because schedulers like maui had to issue RM > >requests to every node. These days pbs_server is already providing > >that information directly to schedulers. > > > >Without RPP, momctl must allocate a new socket and bind to a > >priviledged port to talk to a MOM. When you talk to many MOMs quickly, > >you can easily run out of priviledged ports to TIME_WAIT. > > > >With RPP, momctl works with one socket on one priviledged port. > > now I believe that the Adaptive Computing developers are working on > removing the last vestiges of RPP from TORQUE. I initially enabled it as I thought it would help make firewalling torque easier forcing only TCP traffic. My clients are in a different zone than my head/compute node, but I think I'll still only need to firewall TCP 15001-4 from clients to head node. Does Torque use UDP at all? Thanks! Josh From JTRUTWIN at CSBSJU.EDU Tue Sep 13 08:48:59 2011 From: JTRUTWIN at CSBSJU.EDU (Trutwin, Joshua) Date: Tue, 13 Sep 2011 14:48:59 +0000 Subject: [torqueusers] Torque rpm build headaches In-Reply-To: <20110912220303.GQ10643@lbl.gov> References: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> <20110912220303.GQ10643@lbl.gov> Message-ID: <710C58696EA3BC42B425E4DBB39C1D5E25C93F61@MAIL-MBX2.ad.csbsju.edu> > > Hi - I just setup torque for a single compute node (currently). I'm > > having some trouble with the make rpm command passing all my > configure > > params to the rpmbuild. > > Its goal is not to pass *all* your configure parameters to the rpmbuild > process because doing so does not make sense. At present, it only passes > those arguments for which the spec file has conditional build support or > macro variables in it. This includes things like DRMAA, SCP, syslog, PAM, > and (as you saw) the server name and path. > RPP is not currently supported, but I hope to add support for more > conditionals as time progresses. > > That said, since I don't use a lot of the conditionals torque supports, I can't > test many of them. Thus, community contributions in the way of patches > are most appreciated. :-) Thanks for the info, I can maybe help out a bit here. > > The --with-default-server was passed but not the --disable-rpp or > > --prefix=/opt/torque-2.5.8. And --disable-gui is passed in even > > though it's enabled when running the ./configure on it's own. > > Unfortunately "gui" is not one of those yet supported by "make rpm" > and is disabled by default in the spec, which explains the behavior you saw. > You're right; this does need to be fixed. Well it works with the --with gui on RPMOPTS, could maybe just add a note to the install doc? > > I can get around these by running: > > > > make RPMOPTS="--with gui --without rpp" rpm > > > > I can't seem to change the prefix this way though, so what I do here is > open torque.spec and add this line: > > > > %define _prefix /opt/torque-2.5.8 > > > > Which gets wiped out if I reconfigure. > > Did you try: > > make RPMOPTS="--with gui --without rpp --define '_prefix /opt/torque- > 2.5.8'" rpm Doh, didn't think of that simple solution. Shows how often I use RPMOPTS... > > Unfortunately with this though now it's picking a nonexistent sendmail > > so I added -with-sendmail=/usr/sbin/sendmail to configure the sendmail > > path but it still uses /opt/torque-2.5.8/sbin/sendmail so I have to > > add a symlink here after installing the rpms... > > This, too, can be specified in RPMOPTS as "--define 'sendmail_path > /usr/sbin/sendmail'" Combined with the above do you have two --defines like so? --define '_prefix /opt/torque-2.5.8' --define 'sendmail_path /usr/sbin/sendmail' This seems to work... > > Can this be simplified at all so it works like it says it should from > > the INSTALL doc please: > > > > TORQUE has built-in support for making RPMs. After running > > ./configure with all desired options, 'make rpm' should create a > > set of binary RPMs that match your configuration. > > > > I take "match your configuration" to mean that it'll keep all my > > original ./configure options. > > Certain things are not currently supported because I am unable to test or > haven't had time to support them all. Some are legitimate bugs. > And some things aren't supported because they aren't standard practice > and/or are generally unwise. Replicating the GNU configure prefix (default > /usr/local) to the RPM prefix (default determined by build host RPM > configuration, usually /usr) by default is one of those things that falls into > the latter category and should only be done explicitly and with sufficient > care and forethought. :-) > > It should NEVER keep "all original ./configure options" because this would > create unmaintainable messes (as has been seen previously). But hopefully > over time the RPMOPTS mechanism can become more featureful to allow > greater site-specific customizations of built packages without the previous > problems with information leakage. I guess maybe I could write a patch against the INSTALL doc to be more explicit about what options are not supported and also show the example you had with overriding the prefix/sendmail path. Thanks for your help! Josh From knielson at adaptivecomputing.com Tue Sep 13 09:02:09 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 13 Sep 2011 09:02:09 -0600 (MDT) Subject: [torqueusers] Torque rpm build headaches In-Reply-To: <710C58696EA3BC42B425E4DBB39C1D5E25C93F58@MAIL-MBX2.ad.csbsju.edu> Message-ID: ----- Original Message ----- > From: "Joshua Trutwin" > To: "Torque Users Mailing List" > Sent: Tuesday, September 13, 2011 8:41:02 AM > Subject: Re: [torqueusers] Torque rpm build headaches > I initially enabled it as I thought it would help make firewalling > torque easier forcing only TCP traffic. My clients are in a > different zone than my head/compute node, but I think I'll still > only need to firewall TCP 15001-4 from clients to head node. Does > Torque use UDP at all? > > Thanks! > > Josh TORQUE uses only TCP between client apps and pbs_server. But pbs_server and the compute nodes use UDP and TCP. In TORQUE 4.0 everything will be TCP. Ken From knielson at adaptivecomputing.com Tue Sep 13 09:08:16 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 13 Sep 2011 09:08:16 -0600 (MDT) Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <4E6F1CDD.5020303@fi.muni.cz> Message-ID: <7ea30c15-1dcc-439c-8232-ebd460d5159c@mail> ----- Original Message ----- > From: "Mgr. ?imon T?th" > To: "Torque Users Mailing List" > Sent: Tuesday, September 13, 2011 3:05:33 AM > Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+ > > > We?ve been bitten by a strange problem twice now in Torque, so I > > thought > > I?d check to see if anyone else has run into it. We are running > > Torque > > 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server > > daemon > > hangs. All qstat or pbsnodes commands fail. The process is still > > in > > memory but it drops to 0% CPU utilization. > > > > > > > > Restarting the pbs_server allows it to come back up for a few > > seconds > > but then it hangs again. If I clear out all the jobs in the ?jobs? > > directory and restart the server it comes back up fine. The last > > time > > this happened, I was able to move jobs back into the directory a > > few at > > a time and keep restarting the pbs_server until I isolated the few > > jobs > > that were causing the server to hang. Checking the files, all of > > these > > jobs were running on two nodes that had crashed. > > > > > > > > So, in essence, a pbs_mom node crashed and took down the entire > > cluster > > with it. As I said, we?ve seen this happen twice now. Has anyone > > else > > seen this? > > The issue is that the dis_tcp_wflush() function can hang for a > loooong > time. The server will wait until all data are sent, which can be > hours, > if the other side is slow enough. Also this will hang until timeouts > occur when the other side is dead. > > The patch I included is what we have done with the tcp_dis.c file. > Sorry, that the patch isn't clean, but unfortunatelly, we have a lot > of > fixes and I don't really have the time to dig one specific out. > > Specificially ignore the GSSAPI (kerberos) stuff and concentrate on > the > dis_tcp_wflush() function. The alarms are the important stuff. > > We are using 60 seconds timeouts, you should tailor that towards your > cluster, the most stuff that will get send here are qstat replies, > which > for 8000 jobs should be somewhere around 100MB of data. > > -- > Mgr. Simon Toth > I just looked at Simon's patch and it is in the right direction. If the problem you have is that the tcp calls do not return promptly then you can be waiting a long time (10800 seconds by default). Regards Ken Nielson Adaptive Computing From JTRUTWIN at CSBSJU.EDU Tue Sep 13 09:11:55 2011 From: JTRUTWIN at CSBSJU.EDU (Trutwin, Joshua) Date: Tue, 13 Sep 2011 15:11:55 +0000 Subject: [torqueusers] Torque rpm build headaches In-Reply-To: References: <710C58696EA3BC42B425E4DBB39C1D5E25C93F58@MAIL-MBX2.ad.csbsju.edu> Message-ID: <710C58696EA3BC42B425E4DBB39C1D5E25C94070@MAIL-MBX2.ad.csbsju.edu> > TORQUE uses only TCP between client apps and pbs_server. But pbs_server > and the compute nodes use UDP and TCP. In TORQUE 4.0 everything will be > TCP. Thanks Ken - will it also do my taxes? I'm new to torque and really impressed, pretty spiffy and not the nightmare to setup as I was led to believe. Josh From charles.johnson at accre.vanderbilt.edu Tue Sep 13 11:55:09 2011 From: charles.johnson at accre.vanderbilt.edu (Charles Johnson) Date: Tue, 13 Sep 2011 12:55:09 -0500 Subject: [torqueusers] Strange problem in Torque 2.5.7+ In-Reply-To: <7ea30c15-1dcc-439c-8232-ebd460d5159c@mail> References: <7ea30c15-1dcc-439c-8232-ebd460d5159c@mail> Message-ID: On Sep 13, 2011, at 10:08 AM, Ken Nielson wrote: > I just looked at Simon's patch and it is in the right direction. If the problem you have is that the tcp calls do not return promptly then you can be waiting a long time (10800 seconds by default). > > Regards > > Ken Nielson > Adaptive Computing FYI, we had a stoppage in the late morning, and the last 6 minutes of the torque log file was sent. I will have a look at the patch. ~Charles~ -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research & Education From mej at lbl.gov Tue Sep 13 19:58:26 2011 From: mej at lbl.gov (Michael Jennings) Date: Tue, 13 Sep 2011 18:58:26 -0700 Subject: [torqueusers] building torque+munge rpm In-Reply-To: <20110913113159.442a7d04@amarrosa.pic.es> References: <20110913113159.442a7d04@amarrosa.pic.es> Message-ID: <20110914015824.GS10643@lbl.gov> On Tuesday, 13 September 2011, at 11:31:59 (+0200), Arnau Bria wrote: > I'm building torque 2.5.7 with munge. I use the command: > > rpmbuild -ta --define 'prefix /usr' --define > 'torque_home /var/spool/pbs' --define 'acflags --enable-munge-auth > --enable-maxdefault --with-readline --with-tcp-retry-limit=2 > --with-rcp=scp --with-default-server=pbs03.pic.es ' torque-2.5.7.tar.gz > > > but, when I see the configure that it does I see a strange > --disable-munge-auth on it: > > + ./configure --host=x86_64-redhat-linux-gnu --build=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --includedir=/usr/include/torque --with-default-server=localhost --with-server-home=/var/spool/pbs --with-sendmail=/usr/sbin/sendmail --disable-dependency-tracking --disable-gui --without-tcl --with-rcp=scp --enable-syslog --disable-gcc-warnings --disable-munge-auth --without-pam --disable-drmaa --enable-high-availability --disable-qsub-keep-override --disable-blcr --disable-cpuset --enable-spool --enable-server-xml --enable-munge-auth --enable-maxdefault --with-readline --with-tcp-retry-limit=2 --with-rcp=scp --with-default-server=pbs03.pic.es > > > I think this is not ok cause later, torque works without munge > started... > > > Is there any problem with that version and munge? or with my command? Since --enable-munge-auth is specified after --disable-munge-auth, it should still be enabled for the build (unless something else went wrong), but your syntax is incorrect. Support for "munge" is one of the conditional build options in the spec file. To turn it on, you need to use "--with munge" on your rpmbuild command line. Don't just add it to acflags. Same goes for all the other conditionals. The conditionals can be found in the spec file: $ grep '^%bcond_' torque.spec %bcond_with blcr %bcond_with cpuset %bcond_with drmaa %bcond_with gui %bcond_with munge %bcond_with pam %bcond_without ha %bcond_without scp %bcond_without spool %bcond_without syslog Any of those needs to be specified via --with or --without. If you want one that's specified as %bcond_with, use --with. If you don't want one that's specified as %bcond_without, use --without. So your command should have been: rpmbuild -ta --with munge --with scp \ --define 'torque_home /var/spool/pbs' \ --define 'torque_server pbs03.pic.es' \ --define 'acflags --enable-maxdefault --with-readline --with-tcp-retry-limit=2' The default prefix for RPM is already /usr unless you've changed it on your system (which is highly ill-advised), so you shouldn't need to specify that. And RPM uses %{_prefix}, not %{prefix}, so defining "prefix" as you did had no effect anyway. :-) As for whether or not munge support was actually compiled in, you'd need to check the configure output. Since munge support does not generate any additional files or packages, there is nothing to break the build if munge support fails. You'll have to capture the output of rpmbuild and examine it to see what problems, if any, arose with munge support. Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From mej at lbl.gov Tue Sep 13 19:59:29 2011 From: mej at lbl.gov (Michael Jennings) Date: Tue, 13 Sep 2011 18:59:29 -0700 Subject: [torqueusers] building torque+munge rpm In-Reply-To: <20110913115058.4fcd127b@amarrosa.pic.es> References: <20110913113159.442a7d04@amarrosa.pic.es> <20110913115058.4fcd127b@amarrosa.pic.es> Message-ID: <20110914015928.GT10643@lbl.gov> On Tuesday, 13 September 2011, at 11:50:58 (+0200), Arnau Bria wrote: > Ok... > > munge-devel is needed.... > > maybe a warning needed??? If "--with munge" is specified, munge-devel will be added as a build-time dependency, and the build will fail if munge-devel is not installed. Adding it directly to acflags bypasses this mechanism. Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From mej at lbl.gov Tue Sep 13 20:13:59 2011 From: mej at lbl.gov (Michael Jennings) Date: Tue, 13 Sep 2011 19:13:59 -0700 Subject: [torqueusers] Torque rpm build headaches In-Reply-To: <710C58696EA3BC42B425E4DBB39C1D5E25C93F61@MAIL-MBX2.ad.csbsju.edu> References: <710C58696EA3BC42B425E4DBB39C1D5E25C911E9@MAIL-MBX2.ad.csbsju.edu> <20110912220303.GQ10643@lbl.gov> <710C58696EA3BC42B425E4DBB39C1D5E25C93F61@MAIL-MBX2.ad.csbsju.edu> Message-ID: <20110914021358.GU10643@lbl.gov> On Tuesday, 13 September 2011, at 14:48:59 (+0000), Trutwin, Joshua wrote: > Thanks for the info, I can maybe help out a bit here. Yay! :-) > Well it works with the --with gui on RPMOPTS, could maybe just add a > note to the install doc? Sure, but I think it would be better to fix it the Right Way(tm). > Combined with the above do you have two --defines like so? > > --define '_prefix /opt/torque-2.5.8' --define 'sendmail_path /usr/sbin/sendmail' Yes, exactly. --define simply sets a macro, and all macros in the spec file that are user-customizable are defined conditionally to preserve prior definitions. > I guess maybe I could write a patch against the INSTALL doc to be > more explicit about what options are not supported and also show the > example you had with overriding the prefix/sendmail path. I'd strongly caution against changing the prefix or encouraging others to do so in a published document. As a general rule, you can only have 1 package of a particular name installed at once. (Yes, there are exceptions.) If you use "rpm -Uvh" or "yum install/upgrade" on a package, the default behavior is to replace that package, even if there are no conflicts. Specifying an alternate prefix implies that multiple versions can be installed simultaneously which, while technically true (if one knows how), can be misleading and hazardous to the unsuspecting admin (who might not know how). This is particularly the case if the version number is included in the prefix. If a particular site wants to maintain packages that way, that's certainly their choice, and there are very good reasons for doing so. But there are also risks, and without taking the time to delineate all those risks, it's better to just say "Don't do this unless you really, really mean it" and leave it at that. :-) Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From arnaubria at pic.es Wed Sep 14 03:18:43 2011 From: arnaubria at pic.es (Arnau Bria) Date: Wed, 14 Sep 2011 11:18:43 +0200 Subject: [torqueusers] building torque+munge rpm In-Reply-To: <20110914015824.GS10643@lbl.gov> References: <20110913113159.442a7d04@amarrosa.pic.es> <20110914015824.GS10643@lbl.gov> Message-ID: <20110914111843.1d9dbbdf@amarrosa.pic.es> On Tue, 13 Sep 2011 18:58:26 -0700 Michael Jennings wrote: Hi Michael, Many thanks for you detailed reply. Cheers, Arnau From dave.zarnoch at sykes.com Wed Sep 14 07:41:37 2011 From: dave.zarnoch at sykes.com (Zarnoch, Dave) Date: Wed, 14 Sep 2011 09:41:37 -0400 Subject: [torqueusers] Batch not running Message-ID: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> Hello folks, New to Torque, used to run NQS.... Concerning Torque... I have a small script: $ more dn_test.sh #!/bin/sh # PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr /bin/X11 :/sbin:. export PATH DATE=`date +%H%M` echo "Hello" touch /tmp/dn_test_${DATE} sleep 90 When I submit the script: qsub -V -l nodes=1 -q dn dn_test.sh It runs fine. But I need to run batch... I created a text file "dn_test.txt" That contains: /home/zarnocda/torque/scripts_test/dn_test.sh When I run: qsub -V -l nodes=1 -q dn dn_test.txt It appears to process the file: qstat -s Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn But it doesn't excute the script within: /home/zarnocda/torque/scripts_test/dn_test.sh Any help! Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/6cd856cc/attachment.html From jjc at iastate.edu Wed Sep 14 09:59:56 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Wed, 14 Sep 2011 10:59:56 -0500 Subject: [torqueusers] Batch not running : Things to check In-Reply-To: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> Message-ID: Dave, Welcome to Torque. I switched from NQS some time ago, and Torque/PBS has been a good replacement for me. Things to check: What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? You may need: chmod u+x /home/zarnocda/torque/scripts_test/dn_test.sh I'd also check if /home/zarnocda/torque/scripts_test/dn_test.sh even exists on the compute node. e.g. ls /home/zarnocda/torque/scripts_test/dn_test.sh executable I usually use the interactive opion ( qsub -I ) to debug these kinds of problems. You could issue: qsub -V -I -l nodes=1 -q dn which will start an interactive jobs and log you into the mother superior node for that job where you can then try issuing the commands within your job that is not working. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zarnoch, Dave Sent: Wednesday, September 14, 2011 8:42 AM To: torqueusers at supercluster.org Subject: [torqueusers] Batch not running Importance: High Hello folks, New to Torque, used to run NQS.... Concerning Torque... I have a small script: $ more dn_test.sh #!/bin/sh # PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr/bin/X11 :/sbin:. export PATH DATE=`date +%H%M` echo "Hello" touch /tmp/dn_test_${DATE} sleep 90 When I submit the script: qsub -V -l nodes=1 -q dn dn_test.sh It runs fine. But I need to run batch... I created a text file "dn_test.txt" That contains: /home/zarnocda/torque/scripts_test/dn_test.sh When I run: qsub -V -l nodes=1 -q dn dn_test.txt It appears to process the file: qstat -s Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn But it doesn't excute the script within: /home/zarnocda/torque/scripts_test/dn_test.sh Any help! Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/3316f928/attachment-0001.html From dave.zarnoch at sykes.com Wed Sep 14 10:30:13 2011 From: dave.zarnoch at sykes.com (Zarnoch, Dave) Date: Wed, 14 Sep 2011 12:30:13 -0400 Subject: [torqueusers] Batch not running : Things to check In-Reply-To: References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> Message-ID: <7651D43C8FD38F458F022D03E5AF91E5FECC4C@ustpamxc005.amer.sykes.com> Thanks James for your expedient help! I have until 10/1 to get this working :0) To answer your questions: What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) Where would this output file exist? Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? -rwxrwxr-x 1 zarnocda zarnocda 183 Sep 14 09:27 dn_test.sh -rwxr-xr-x 1 zarnocda zarnocda 46 Sep 14 09:17 dn_test.txt Located on the compute node: /home/zarnocda/torque/scripts_test/dn_test.sh And /home/zarnocda/torque/scripts_test/dn_test.txt I'll try your suggestion about interactive mode. THANKS AGAIN! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Wednesday, September 14, 2011 12:00 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running : Things to check Dave, Welcome to Torque. I switched from NQS some time ago, and Torque/PBS has been a good replacement for me. Things to check: What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? You may need: chmod u+x /home/zarnocda/torque/scripts_test/dn_test.sh I'd also check if /home/zarnocda/torque/scripts_test/dn_test.sh even exists on the compute node. e.g. ls /home/zarnocda/torque/scripts_test/dn_test.sh executable I usually use the interactive opion ( qsub -I ) to debug these kinds of problems. You could issue: qsub -V -I -l nodes=1 -q dn which will start an interactive jobs and log you into the mother superior node for that job where you can then try issuing the commands within your job that is not working. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zarnoch, Dave Sent: Wednesday, September 14, 2011 8:42 AM To: torqueusers at supercluster.org Subject: [torqueusers] Batch not running Importance: High Hello folks, New to Torque, used to run NQS.... Concerning Torque... I have a small script: $ more dn_test.sh #!/bin/sh # PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr /bin/X11 :/sbin:. export PATH DATE=`date +%H%M` echo "Hello" touch /tmp/dn_test_${DATE} sleep 90 When I submit the script: qsub -V -l nodes=1 -q dn dn_test.sh It runs fine. But I need to run batch... I created a text file "dn_test.txt" That contains: /home/zarnocda/torque/scripts_test/dn_test.sh When I run: qsub -V -l nodes=1 -q dn dn_test.txt It appears to process the file: qstat -s Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn But it doesn't excute the script within: /home/zarnocda/torque/scripts_test/dn_test.sh Any help! Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/31b72c19/attachment-0001.html From dave.zarnoch at sykes.com Wed Sep 14 10:48:01 2011 From: dave.zarnoch at sykes.com (Zarnoch, Dave) Date: Wed, 14 Sep 2011 12:48:01 -0400 Subject: [torqueusers] Batch not running : Things to check In-Reply-To: References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> Message-ID: <7651D43C8FD38F458F022D03E5AF91E5FECC6F@ustpamxc005.amer.sykes.com> James, I tried entering: qsub -V -I -l nodes=1 -q dn and it just hangs there Do I have a problem with "mom"? Here's some files in mom_priv: usphl1ora002@/var/spool/torque/mom_priv>ls -l jobs total 0 usphl1ora002@/var/spool/torque/mom_priv>more config $pbsserver usphl1ora002.amer.sykes.com # note: hostname running pbs_server $logevent 255 # bitmap of which events to log usphl1ora002@/var/spool/torque/mom_priv>more mom.lock 25994 usphl1ora002@/var/spool/torque/mom_priv>ps -ef | grep 25994 | grep -v grep root 25994 1 0 Sep12 ? 00:01:03 /usr/local/sbin/pbs_mom -p Not really familiar with "mom" I also don't have a lot of documentation on Torque... Do you know of any good web pages? Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Wednesday, September 14, 2011 12:00 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running : Things to check Dave, Welcome to Torque. I switched from NQS some time ago, and Torque/PBS has been a good replacement for me. Things to check: What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? You may need: chmod u+x /home/zarnocda/torque/scripts_test/dn_test.sh I'd also check if /home/zarnocda/torque/scripts_test/dn_test.sh even exists on the compute node. e.g. ls /home/zarnocda/torque/scripts_test/dn_test.sh executable I usually use the interactive opion ( qsub -I ) to debug these kinds of problems. You could issue: qsub -V -I -l nodes=1 -q dn which will start an interactive jobs and log you into the mother superior node for that job where you can then try issuing the commands within your job that is not working. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zarnoch, Dave Sent: Wednesday, September 14, 2011 8:42 AM To: torqueusers at supercluster.org Subject: [torqueusers] Batch not running Importance: High Hello folks, New to Torque, used to run NQS.... Concerning Torque... I have a small script: $ more dn_test.sh #!/bin/sh # PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr /bin/X11 :/sbin:. export PATH DATE=`date +%H%M` echo "Hello" touch /tmp/dn_test_${DATE} sleep 90 When I submit the script: qsub -V -l nodes=1 -q dn dn_test.sh It runs fine. But I need to run batch... I created a text file "dn_test.txt" That contains: /home/zarnocda/torque/scripts_test/dn_test.sh When I run: qsub -V -l nodes=1 -q dn dn_test.txt It appears to process the file: qstat -s Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn But it doesn't excute the script within: /home/zarnocda/torque/scripts_test/dn_test.sh Any help! Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/574d2a2c/attachment-0001.html From sm4082 at nyu.edu Wed Sep 14 11:10:55 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 14 Sep 2011 13:10:55 -0400 Subject: [torqueusers] Batch not running : Things to check In-Reply-To: <7651D43C8FD38F458F022D03E5AF91E5FECC6F@ustpamxc005.amer.sykes.com> References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> <7651D43C8FD38F458F022D03E5AF91E5FECC6F@ustpamxc005.amer.sykes.com> Message-ID: Hi Dave, This is what I have in my config file. $pbsserver crunch.local $usecp crunch.its.nyu.edu:/home /home $spool_as_final_name true I think you need to mention the second line. Best, Sreedhar. On Sep 14, 2011, at 12:48 PM, Zarnoch, Dave wrote: > James, > > > > I tried entering: > > qsub -V ?I -l nodes=1 -q dn > > and it just hangs there > > > > Do I have a problem with ?mom?? > > Here?s some files in mom_priv: > > > > usphl1ora002@/var/spool/torque/mom_priv>ls -l jobs > total 0 > > usphl1ora002@/var/spool/torque/mom_priv>more config > $pbsserver usphl1ora002.amer.sykes.com # note: hostname running pbs_server > $logevent 255 # bitmap of which events to log > > > usphl1ora002@/var/spool/torque/mom_priv>more mom.lock > 25994 > > usphl1ora002@/var/spool/torque/mom_priv>ps -ef | grep 25994 | grep -v grep > root 25994 1 0 Sep12 ? 00:01:03 /usr/local/sbin/pbs_mom -p > > > Not really familiar with ?mom? > > > > I also don?t have a lot of documentation on Torque? > > Do you know of any good web pages? > > Thanks! > > Dave > > > > Dave Zarnoch > > UNIX Systems Administration > > (215)200-0911 > > Dave.Zarnoch at sykes.com > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] > Sent: Wednesday, September 14, 2011 12:00 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Batch not running : Things to check > > > Dave, > > > > Welcome to Torque. I switched from NQS some time ago, and Torque/PBS has been a good replacement for me. > > Things to check: > > What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) > > Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? > > You may need: chmod u+x /home/zarnocda/torque/scripts_test/dn_test.sh > > > > I?d also check if /home/zarnocda/torque/scripts_test/dn_test.sh > > even exists on the compute node. > > e.g. ls /home/zarnocda/torque/scripts_test/dn_test.sh executable > > > > I usually use the interactive opion ( qsub ?I ) to debug these kinds of problems. > > You could issue: > > qsub -V ?I -l nodes=1 -q dn > > which will start an interactive jobs and log you into the mother superior node for that job > > where you can then try issuing the commands within your job that is not working. > > > > James Coyle, PhD > High Performance Computing Group > Iowa State Univ. > web: http://jjc.public.iastate.edu/ > > > > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zarnoch, Dave > Sent: Wednesday, September 14, 2011 8:42 AM > To: torqueusers at supercluster.org > Subject: [torqueusers] Batch not running > Importance: High > > > Hello folks, > > New to Torque, used to run NQS?. > > Concerning Torque? > > I have a small script: > > $ more dn_test.sh > #!/bin/sh > # > PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr/bin/X11 > :/sbin:. > export PATH > DATE=`date +%H%M` > echo "Hello" > touch /tmp/dn_test_${DATE} > sleep 90 > > When I submit the script: > > qsub -V -l nodes=1 -q dn dn_test.sh > > It runs fine. > > But I need to run batch? > > I created a text file ?dn_test.txt? > > That contains: > > /home/zarnocda/torque/scripts_test/dn_test.sh > > When I run: > > qsub -V -l nodes=1 ?q dn dn_test.txt > > > It appears to process the file: > > qstat ?s > > Job id Name User Time Use S Queue > > ------------------------- ---------------- --------------- -------- - ----- > > 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn > > > > But it doesn?t excute the script within: > > /home/zarnocda/torque/scripts_test/dn_test.sh > > > Any help! > > > > Thanks! > > > > Dave > > Dave Zarnoch > > UNIX Systems Administration > > (215)200-0911 > > Dave.Zarnoch at sykes.com > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/62353063/attachment-0001.html From dave.zarnoch at sykes.com Wed Sep 14 11:18:58 2011 From: dave.zarnoch at sykes.com (Zarnoch, Dave) Date: Wed, 14 Sep 2011 13:18:58 -0400 Subject: [torqueusers] Batch not running : Things to check In-Reply-To: References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com><7651D43C8FD38F458F022D03E5AF91E5FECC6F@ustpamxc005.amer.sykes.com> Message-ID: <7651D43C8FD38F458F022D03E5AF91E5FECC9A@ustpamxc005.amer.sykes.com> Sreedhar. Thanks for your suggestion! Just a question.... The second line: $usecp crunch.its.nyu.edu:/home /home Is this because the script that I'm running is located in /home or is the location "/home" used for something else? Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sreedhar Manchu Sent: Wednesday, September 14, 2011 1:11 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running : Things to check Hi Dave, This is what I have in my config file. $pbsserver crunch.local $usecp crunch.its.nyu.edu:/home /home $spool_as_final_name true I think you need to mention the second line. Best, Sreedhar. On Sep 14, 2011, at 12:48 PM, Zarnoch, Dave wrote: James, I tried entering: qsub -V -I -l nodes=1 -q dn and it just hangs there Do I have a problem with "mom"? Here's some files in mom_priv: usphl1ora002@/var/spool/torque/mom_priv>ls -l jobs total 0 usphl1ora002@/var/spool/torque/mom_priv>more config $pbsserver usphl1ora002.amer.sykes.com # note: hostname running pbs_server $logevent 255 # bitmap of which events to log usphl1ora002@/var/spool/torque/mom_priv>more mom.lock 25994 usphl1ora002@/var/spool/torque/mom_priv>ps -ef | grep 25994 | grep -v grep root 25994 1 0 Sep12 ? 00:01:03 /usr/local/sbin/pbs_mom -p Not really familiar with "mom" I also don't have a lot of documentation on Torque... Do you know of any good web pages? Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Wednesday, September 14, 2011 12:00 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running : Things to check Dave, Welcome to Torque. I switched from NQS some time ago, and Torque/PBS has been a good replacement for me. Things to check: What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? You may need: chmod u+x /home/zarnocda/torque/scripts_test/dn_test.sh I'd also check if /home/zarnocda/torque/scripts_test/dn_test.sh even exists on the compute node. e.g. ls /home/zarnocda/torque/scripts_test/dn_test.sh executable I usually use the interactive opion ( qsub -I ) to debug these kinds of problems. You could issue: qsub -V -I -l nodes=1 -q dn which will start an interactive jobs and log you into the mother superior node for that job where you can then try issuing the commands within your job that is not working. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zarnoch, Dave Sent: Wednesday, September 14, 2011 8:42 AM To: torqueusers at supercluster.org Subject: [torqueusers] Batch not running Importance: High Hello folks, New to Torque, used to run NQS.... Concerning Torque... I have a small script: $ more dn_test.sh #!/bin/sh # PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr /bin/X11 :/sbin:. export PATH DATE=`date +%H%M` echo "Hello" touch /tmp/dn_test_${DATE} sleep 90 When I submit the script: qsub -V -l nodes=1 -q dn dn_test.sh It runs fine. But I need to run batch... I created a text file "dn_test.txt" That contains: /home/zarnocda/torque/scripts_test/dn_test.sh When I run: qsub -V -l nodes=1 -q dn dn_test.txt It appears to process the file: qstat -s Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn But it doesn't excute the script within: /home/zarnocda/torque/scripts_test/dn_test.sh Any help! Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/c26b9e30/attachment-0001.html From sm4082 at nyu.edu Wed Sep 14 11:24:56 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 14 Sep 2011 13:24:56 -0400 Subject: [torqueusers] Batch not running : Things to check In-Reply-To: <7651D43C8FD38F458F022D03E5AF91E5FECC9A@ustpamxc005.amer.sykes.com> References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com><7651D43C8FD38F458F022D03E5AF91E5FECC6F@ustpamxc005.amer.sykes.com> <7651D43C8FD38F458F022D03E5AF91E5FECC9A@ustpamxc005.amer.sykes.com> Message-ID: <5822801B-6D44-4D63-93A9-C640078E75E0@nyu.edu> Hi Dave, This line says which directories from the host should be staged on to the compute node's destination directory. This is what I found from torque documentation. I included the link to the page below. Hopefully, this helps. $usecp Format: : Description: Specifies which directories should be staged (see TORQUE Data Management) Example: $usecp *.fte.com:/data /usr/local/data http://www.clusterresources.com/torquedocs21/a.cmomconfig.shtml Best, Sreedhar. On Sep 14, 2011, at 1:18 PM, Zarnoch, Dave wrote: > Sreedhar. > > > > Thanks for your suggestion! > > > > Just a question?. > > The second line: > > $usecp crunch.its.nyu.edu:/home /home > > > Is this because the script that I?m running is located in /home or is the location ?/home? used for something else? > > > > Thanks! > > > > Dave > > > > Dave Zarnoch > > UNIX Systems Administration > > (215)200-0911 > > Dave.Zarnoch at sykes.com > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sreedhar Manchu > Sent: Wednesday, September 14, 2011 1:11 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Batch not running : Things to check > > > Hi Dave, > > > > This is what I have in my config file. > > > > $pbsserver crunch.local > > $usecp crunch.its.nyu.edu:/home /home > > $spool_as_final_name true > > > > I think you need to mention the second line. > > > > Best, > > Sreedhar. > > > > > > > > On Sep 14, 2011, at 12:48 PM, Zarnoch, Dave wrote: > > > > > James, > > > > I tried entering: > > qsub -V ?I -l nodes=1 -q dn > > and it just hangs there > > > > Do I have a problem with ?mom?? > > Here?s some files in mom_priv: > > > > usphl1ora002@/var/spool/torque/mom_priv>ls -l jobs > total 0 > > usphl1ora002@/var/spool/torque/mom_priv>more config > $pbsserver usphl1ora002.amer.sykes.com # note: hostname running pbs_server > $logevent 255 # bitmap of which events to log > > > usphl1ora002@/var/spool/torque/mom_priv>more mom.lock > 25994 > > usphl1ora002@/var/spool/torque/mom_priv>ps -ef | grep 25994 | grep -v grep > root 25994 1 0 Sep12 ? 00:01:03 /usr/local/sbin/pbs_mom -p > > > Not really familiar with ?mom? > > > > I also don?t have a lot of documentation on Torque? > > Do you know of any good web pages? > > Thanks! > > Dave > > > > Dave Zarnoch > > UNIX Systems Administration > > (215)200-0911 > > Dave.Zarnoch at sykes.com > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] > Sent: Wednesday, September 14, 2011 12:00 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Batch not running : Things to check > > > Dave, > > > > Welcome to Torque. I switched from NQS some time ago, and Torque/PBS has been a good replacement for me. > > Things to check: > > What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) > > Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? > > You may need: chmod u+x /home/zarnocda/torque/scripts_test/dn_test.sh > > > > I?d also check if /home/zarnocda/torque/scripts_test/dn_test.sh > > even exists on the compute node. > > e.g. ls /home/zarnocda/torque/scripts_test/dn_test.sh executable > > > > I usually use the interactive opion ( qsub ?I ) to debug these kinds of problems. > > You could issue: > > qsub -V ?I -l nodes=1 -q dn > > which will start an interactive jobs and log you into the mother superior node for that job > > where you can then try issuing the commands within your job that is not working. > > > > James Coyle, PhD > High Performance Computing Group > Iowa State Univ. > web: http://jjc.public.iastate.edu/ > > > > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zarnoch, Dave > Sent: Wednesday, September 14, 2011 8:42 AM > To: torqueusers at supercluster.org > Subject: [torqueusers] Batch not running > Importance: High > > > Hello folks, > > New to Torque, used to run NQS?. > > Concerning Torque? > > I have a small script: > > $ more dn_test.sh > #!/bin/sh > # > PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr/bin/X11 > :/sbin:. > export PATH > DATE=`date +%H%M` > echo "Hello" > touch /tmp/dn_test_${DATE} > sleep 90 > > When I submit the script: > > qsub -V -l nodes=1 -q dn dn_test.sh > > It runs fine. > > But I need to run batch? > > I created a text file ?dn_test.txt? > > That contains: > > /home/zarnocda/torque/scripts_test/dn_test.sh > > When I run: > > qsub -V -l nodes=1 ?q dn dn_test.txt > > > It appears to process the file: > > qstat ?s > > Job id Name User Time Use S Queue > > ------------------------- ---------------- --------------- -------- - ----- > > 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn > > > > But it doesn?t excute the script within: > > /home/zarnocda/torque/scripts_test/dn_test.sh > > > Any help! > > > > Thanks! > > > > Dave > > Dave Zarnoch > > UNIX Systems Administration > > (215)200-0911 > > Dave.Zarnoch at sykes.com > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/74a4a532/attachment-0001.html From gus at ldeo.columbia.edu Wed Sep 14 11:29:50 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 14 Sep 2011 13:29:50 -0400 Subject: [torqueusers] Batch not running In-Reply-To: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> Message-ID: <4E70E48E.1060206@ldeo.columbia.edu> Hi Dave Did you make dn_test.sh executable? (chmod u+x dn_test.sh) Did you enable scheduling? (qmgr -c 'set server scheduling = True') Gus Correa Zarnoch, Dave wrote: > Hello folks, > > New to Torque, used to run NQS?. > > Concerning Torque? > > I have a small script: > > $ more dn_test.sh > > #!/bin/sh > > # > > PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr/bin/X11 > > :/sbin:. > > export PATH > > DATE=`date +%H%M` > > echo "Hello" > > touch /tmp/dn_test_${DATE} > > sleep 90 > > > > When I submit the script: > > > > qsub -V -l nodes=1 -q dn dn_test.sh > > > > It runs fine. > > > > But I need to run batch? > > > > I created a text file ?dn_test.txt? > > > > That contains: > > > > /home/zarnocda/torque/scripts_test/dn_test.sh > > > > When I run: > > > > qsub -V -l nodes=1 ?q dn dn_test.txt > > > > It appears to process the file: > > qstat ?s > > *Job id Name User Time Use S > Queue* > > *------------------------- ---------------- --------------- -------- - > -----* > > *7592.usphl1ora002.amer dn_test.txt zarnocda 0 R > dn * > > > > But it doesn?t excute the script within: > > /home/zarnocda/torque/scripts_test/dn_test.sh > > > > Any help! > > > > Thanks! > > > > Dave > > Dave Zarnoch > > UNIX Systems Administration > > (215)200-0911 > > Dave.Zarnoch at sykes.com > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dave.zarnoch at sykes.com Wed Sep 14 11:31:13 2011 From: dave.zarnoch at sykes.com (Zarnoch, Dave) Date: Wed, 14 Sep 2011 13:31:13 -0400 Subject: [torqueusers] Batch not running : Things to check In-Reply-To: <5822801B-6D44-4D63-93A9-C640078E75E0@nyu.edu> References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com><7651D43C8FD38F458F022D03E5AF91E5FECC6F@ustpamxc005.amer.sykes.com><7651D43C8FD38F458F022D03E5AF91E5FECC9A@ustpamxc005.amer.sykes.com> <5822801B-6D44-4D63-93A9-C640078E75E0@nyu.edu> Message-ID: <7651D43C8FD38F458F022D03E5AF91E5FECCAC@ustpamxc005.amer.sykes.com> Thanks! I'll give that a shot! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sreedhar Manchu Sent: Wednesday, September 14, 2011 1:25 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running : Things to check Hi Dave, This line says which directories from the host should be staged on to the compute node's destination directory. This is what I found from torque documentation. I included the link to the page below. Hopefully, this helps. $usecp Format: : Description: Specifies which directories should be staged (see TORQUE Data Management ) Example: $usecp *.fte.com:/data /usr/local/data http://www.clusterresources.com/torquedocs21/a.cmomconfig.shtml Best, Sreedhar. On Sep 14, 2011, at 1:18 PM, Zarnoch, Dave wrote: Sreedhar. Thanks for your suggestion! Just a question.... The second line: $usecp crunch.its.nyu.edu:/home /home Is this because the script that I'm running is located in /home or is the location "/home" used for something else? Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sreedhar Manchu Sent: Wednesday, September 14, 2011 1:11 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running : Things to check Hi Dave, This is what I have in my config file. $pbsserver crunch.local $usecp crunch.its.nyu.edu:/home /home $spool_as_final_name true I think you need to mention the second line. Best, Sreedhar. On Sep 14, 2011, at 12:48 PM, Zarnoch, Dave wrote: James, I tried entering: qsub -V -I -l nodes=1 -q dn and it just hangs there Do I have a problem with "mom"? Here's some files in mom_priv: usphl1ora002@/var/spool/torque/mom_priv>ls -l jobs total 0 usphl1ora002@/var/spool/torque/mom_priv>more config $pbsserver usphl1ora002.amer.sykes.com # note: hostname running pbs_server $logevent 255 # bitmap of which events to log usphl1ora002@/var/spool/torque/mom_priv>more mom.lock 25994 usphl1ora002@/var/spool/torque/mom_priv>ps -ef | grep 25994 | grep -v grep root 25994 1 0 Sep12 ? 00:01:03 /usr/local/sbin/pbs_mom -p Not really familiar with "mom" I also don't have a lot of documentation on Torque... Do you know of any good web pages? Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com ________________________________ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Coyle, James J [ITACD] Sent: Wednesday, September 14, 2011 12:00 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running : Things to check Dave, Welcome to Torque. I switched from NQS some time ago, and Torque/PBS has been a good replacement for me. Things to check: What does the error output say? ( Probably in file dn_test.txt.e[0-9]* ) Permissions on /home/zarnocda/torque/scripts_test/dn_test.sh , is it executable? You may need: chmod u+x /home/zarnocda/torque/scripts_test/dn_test.sh I'd also check if /home/zarnocda/torque/scripts_test/dn_test.sh even exists on the compute node. e.g. ls /home/zarnocda/torque/scripts_test/dn_test.sh executable I usually use the interactive opion ( qsub -I ) to debug these kinds of problems. You could issue: qsub -V -I -l nodes=1 -q dn which will start an interactive jobs and log you into the mother superior node for that job where you can then try issuing the commands within your job that is not working. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zarnoch, Dave Sent: Wednesday, September 14, 2011 8:42 AM To: torqueusers at supercluster.org Subject: [torqueusers] Batch not running Importance: High Hello folks, New to Torque, used to run NQS.... Concerning Torque... I have a small script: $ more dn_test.sh #!/bin/sh # PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr /bin/X11 :/sbin:. export PATH DATE=`date +%H%M` echo "Hello" touch /tmp/dn_test_${DATE} sleep 90 When I submit the script: qsub -V -l nodes=1 -q dn dn_test.sh It runs fine. But I need to run batch... I created a text file "dn_test.txt" That contains: /home/zarnocda/torque/scripts_test/dn_test.sh When I run: qsub -V -l nodes=1 -q dn dn_test.txt It appears to process the file: qstat -s Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 7592.usphl1ora002.amer dn_test.txt zarnocda 0 R dn But it doesn't excute the script within: /home/zarnocda/torque/scripts_test/dn_test.sh Any help! Thanks! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/d0b8b40a/attachment-0001.html From dave.zarnoch at sykes.com Wed Sep 14 11:35:25 2011 From: dave.zarnoch at sykes.com (Zarnoch, Dave) Date: Wed, 14 Sep 2011 13:35:25 -0400 Subject: [torqueusers] Batch not running In-Reply-To: <4E70E48E.1060206@ldeo.columbia.edu> References: <7651D43C8FD38F458F022D03E5AF91E5F95B06@ustpamxc005.amer.sykes.com> <4E70E48E.1060206@ldeo.columbia.edu> Message-ID: <7651D43C8FD38F458F022D03E5AF91E5FECCB3@ustpamxc005.amer.sykes.com> Yes, the script is executable. I just issued the command you suggested.. Thanks again! You guys are GREAT! Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa Sent: Wednesday, September 14, 2011 1:30 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Batch not running Hi Dave Did you make dn_test.sh executable? (chmod u+x dn_test.sh) Did you enable scheduling? (qmgr -c 'set server scheduling = True') Gus Correa Zarnoch, Dave wrote: > Hello folks, > > New to Torque, used to run NQS.... > > Concerning Torque... > > I have a small script: > > $ more dn_test.sh > > #!/bin/sh > > # > > PATH=/bin:/usr/bin:/usr/local/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr /bin/X11 > > :/sbin:. > > export PATH > > DATE=`date +%H%M` > > echo "Hello" > > touch /tmp/dn_test_${DATE} > > sleep 90 > > > > When I submit the script: > > > > qsub -V -l nodes=1 -q dn dn_test.sh > > > > It runs fine. > > > > But I need to run batch... > > > > I created a text file "dn_test.txt" > > > > That contains: > > > > /home/zarnocda/torque/scripts_test/dn_test.sh > > > > When I run: > > > > qsub -V -l nodes=1 -q dn dn_test.txt > > > > It appears to process the file: > > qstat -s > > *Job id Name User Time Use S > Queue* > > *------------------------- ---------------- --------------- -------- - > -----* > > *7592.usphl1ora002.amer dn_test.txt zarnocda 0 R > dn * > > > > But it doesn't excute the script within: > > /home/zarnocda/torque/scripts_test/dn_test.sh > > > > Any help! > > > > Thanks! > > > > Dave > > Dave Zarnoch > > UNIX Systems Administration > > (215)200-0911 > > Dave.Zarnoch at sykes.com > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From vlad at cosy.sbg.ac.at Wed Sep 14 12:08:54 2011 From: vlad at cosy.sbg.ac.at (vlad at cosy.sbg.ac.at) Date: Wed, 14 Sep 2011 20:08:54 +0200 Subject: [torqueusers] Job distributon does not what it is suppossed to do Message-ID: Hi! I'm using torque version 3.0.3-snap.201107121616 I have setup several queues on a cluster with nodes containing gpus and other nodes with opteron CPUs . I have assigned the property "gpunode" to every node containig the Nvidia gpus and "opteron" to every node with our Magny Cours Opterons (which lack of any GPUs..). (Manual of torque subsection 4.1.4) One of my queues is called gpushort, the corespondent other ist optshort. The jobs should be directed to gpus when queued into the gpushort, else to the Opteron nodes if they are queued into "optshort". I'm using now Maui as scheduler, but also have tried for a short time pbs_sched with the same result. This is my output of pbsnodes: gpu01 state = free np = 8 properties = i7,i7-new,gpunode,16G ntype = cluster status = rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 2 gpu_status = gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed Sep 14 19:45:19 2011 gpu02 state = free np = 8 properties = i7,12G,gpunode ntype = cluster status = rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 2 gpu_status = gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed Sep 14 15:41:53 2011 gpu03 state = free np = 8 properties = fermi,12G,gpunode,i7 ntype = cluster status = rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 1 gpu_status = gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14 15:43:29 2011 gpu04 state = free np = 8 properties = i7,gpunode,12G ntype = cluster status = rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43 2011 ... (and so on until gpu07..) .. (and my Opterons, further below ..) hex03 state = free np = 14 properties = opteron ntype = cluster status = rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 hex04 state = free np = 14 properties = opteron ntype = cluster status = rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 hex05 state = free np = 14 properties = opteron ntype = cluster status = rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 ... (until we get to hex14 ...) hex14 state = free np = 14 properties = opteron ntype = cluster status = rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 the configuration of the qmgr for the 2 queues: ... # # Create and define queue gpushort # create queue gpushort set queue gpushort queue_type = Execution set queue gpushort resources_default.neednodes = gpunode set queue gpushort resources_default.nodes = 1 set queue gpushort resources_default.walltime = 24:00:00 set queue gpushort enabled = True set queue gpushort started = True # # Create and define queue optshort # create queue optshort set queue optshort queue_type = Execution set queue optshort resources_default.neednodes = opteron set queue optshort resources_default.nodes = 1 set queue optshort resources_default.walltime = 24:00:00 set queue optshort enabled = True set queue optshort started = True # ... Now, If you submit jobs to gpushort, they get executed on the gpunodes (as it should be). If you choose to submit jobs to optshort, these are supposed to be executed by the opterons, but ,instead of that, they are found to be executed on the 1st gpunode (gpu01) as well. How can I change this bad behaviour ? I'm clueless... Any help appreciated.. Greetings from Salzburg/Austria/Europe Vlad Popa University of Salzburg Computer Science /HPC Computing 5020 Salzburg Austria Europe From knielson at adaptivecomputing.com Wed Sep 14 14:39:39 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 14 Sep 2011 14:39:39 -0600 (MDT) Subject: [torqueusers] Job distributon does not what it is suppossed to do In-Reply-To: Message-ID: <83241193-1cc1-4d03-81a8-754886bd6c33@mail> ----- Original Message ----- > From: vlad at cosy.sbg.ac.at > To: torqueusers at supercluster.org > Sent: Wednesday, September 14, 2011 12:08:54 PM > Subject: [torqueusers] Job distributon does not what it is suppossed to do > > Hi! > > I'm using torque version 3.0.3-snap.201107121616 > I have setup several queues on a cluster with nodes containing > gpus > and other nodes with opteron CPUs . > > I have assigned the property "gpunode" to every node containig the > Nvidia gpus and "opteron" to every node with our Magny Cours > Opterons > (which lack of any GPUs..). (Manual of torque subsection 4.1.4) > > One of my queues is called gpushort, the corespondent other ist > optshort. > The jobs should be directed to gpus when queued into the gpushort, > else > to the Opteron nodes if they are queued into "optshort". > > I'm using now Maui as scheduler, but also have tried for a short > time > pbs_sched with the same result. > > This is my output of pbsnodes: > > gpu01 > state = free > np = 8 > properties = i7,i7-new,gpunode,16G > ntype = cluster > status = > rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046 > 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 > 09:29:38 EDT 2011 x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 2 > gpu_status = > gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed > Sep 14 19:45:19 2011 > > gpu02 > state = free > np = 8 > properties = i7,12G,gpunode > ntype = cluster > status = > rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux > gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 2 > gpu_status = > gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed > Sep 14 15:41:53 2011 > > gpu03 > state = free > np = 8 > properties = fermi,12G,gpunode,i7 > ntype = cluster > status = > rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux > gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 1 > gpu_status = > gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14 > 15:43:29 2011 > > gpu04 > state = free > np = 8 > properties = i7,gpunode,12G > ntype = cluster > status = > rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux > gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43 > 2011 > > ... > (and so on until gpu07..) > > .. > (and my Opterons, further below ..) > > hex03 > state = free > np = 14 > properties = opteron > ntype = cluster > status = > rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux > hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > hex04 > state = free > np = 14 > properties = opteron > ntype = cluster > status = > rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux > hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > hex05 > state = free > np = 14 > properties = opteron > ntype = cluster > status = > rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux > hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > ... > (until we get to hex14 ...) > hex14 > state = free > np = 14 > properties = opteron > ntype = cluster > status = > rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux > hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > the configuration of the qmgr for the 2 queues: > > ... > # > # Create and define queue gpushort > # > create queue gpushort > set queue gpushort queue_type = Execution > set queue gpushort resources_default.neednodes = gpunode > set queue gpushort resources_default.nodes = 1 > set queue gpushort resources_default.walltime = 24:00:00 > set queue gpushort enabled = True > set queue gpushort started = True > # > # Create and define queue optshort > # > create queue optshort > set queue optshort queue_type = Execution > set queue optshort resources_default.neednodes = opteron > set queue optshort resources_default.nodes = 1 > set queue optshort resources_default.walltime = 24:00:00 > set queue optshort enabled = True > set queue optshort started = True > # > ... > > Now, If you submit jobs to gpushort, they get executed on the > gpunodes > (as it should be). If you choose to submit jobs to optshort, > these are > supposed to be executed by the opterons, but ,instead of that, they > are > found to be executed on the 1st gpunode (gpu01) as well. > > How can I change this bad behaviour ? > > I'm clueless... > > Any help appreciated.. > > Greetings from Salzburg/Austria/Europe > > Vlad Popa > > University of Salzburg > Computer Science /HPC Computing > 5020 Salzburg > Austria > Europe We need someone to modify Maui to support GPUs. pbs_sched also does not support GPUs currently. Currently, only Moab knows about GPUs at the scheduler level. Ken Nielson Adaptive Computing From soubari at yahoo.com Wed Sep 14 21:04:41 2011 From: soubari at yahoo.com (sam oubari) Date: Wed, 14 Sep 2011 20:04:41 -0700 (PDT) Subject: [torqueusers] Help! One Puzzle At a Time... * update#2 * In-Reply-To: <1315838724.22667.YahooMailNeo@web110602.mail.gq1.yahoo.com> References: <1315838724.22667.YahooMailNeo@web110602.mail.gq1.yahoo.com> Message-ID: <1316055881.47486.YahooMailNeo@web110609.mail.gq1.yahoo.com> Hi, ? I am using 2.5.6 with pbs_sched?all is running 'local', I am still having problems and here is a recap: ? 1)?A repeating job (it re-qsub?a static script?at the end of each run to re-launch in 10 or 30 mins), will get stuck at Q a couple times a week.? In server_logs,?there is odd coinciding entry: 09/09/2011 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified at request of rpt_prod at naboo.linnbenton.edu qstat shows Hold_Types?changing from n to?o. 2) MOM dies about once a week, clues from /var/log/messages: Sep 14 11:33:22 naboo kernel: pbs_mom[26533]: segfault at 0000790100007868 rip 000000000043136b rsp 00007fff898a3e80 error 4 I got this after re-start: Sep 14 11:41:20 naboo pbs_mom: LOG_ERROR::Invalid argument (22) in rm_request, write string failed Supporting protocol failure message refused from port 1021 addr 127.0.0.1 ? Sometimes, I get: Sep 13 08:29:29 naboo pbs_mom: LOG_ALERT::mom_server_valid_message_source, bad connect from 127.0.0.1:1022 - unauthorized server I am running Redhat 5.6 64-bit, we have 4 queues (max_running = 1),?and we average about a 1000 qsubs?per day (mostly small jobs, 1 minute or less).? When we were 2.4.11, MOM ran much better.? I am running out of ideas, so if you have a similar environment that works, I would love to see your settings.? For example, what options did you 'configure' with? ? Thank you, Sam. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110914/8afbba7e/attachment-0001.html From knielson at adaptivecomputing.com Wed Sep 14 21:18:18 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 14 Sep 2011 21:18:18 -0600 (MDT) Subject: [torqueusers] Help! One Puzzle At a Time... * update#2 * In-Reply-To: <1316055881.47486.YahooMailNeo@web110609.mail.gq1.yahoo.com> Message-ID: <6254e42d-d6c3-4c96-a72f-4de7f618ecd0@mail> ----- Original Message ----- > From: "sam oubari" > To: torqueusers at supercluster.org > Sent: Wednesday, September 14, 2011 9:04:41 PM > Subject: Re: [torqueusers] Help! One Puzzle At a Time... * update#2 * > > > > > Hi, > > I am using 2.5.6 with pbs_sched all is running 'local', I am still > having problems and here is a recap: > > 1) A repeating job (it re-qsub a static script at the end of each run > to re-launch in 10 or 30 mins), will get stuck at Q a couple times a > week. In server_logs, there is odd coinciding entry: > > 09/09/2011 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job > Modified at request of > rpt_prod at naboo.linnbenton.edu > > qstat shows Hold_Types changing from n to o. > > 2) MOM dies about once a week, clues from /var/log/messages: > > Sep 14 11:33:22 naboo kernel: pbs_mom[26533]: segfault at > 0000790100007868 rip 000000000043136b rsp 00007fff898a3e80 error 4 > > I got this after re-start: > Sep 14 11:41:20 naboo pbs_mom: LOG_ERROR::Invalid argument (22) in > rm_request, write string failed Supporting protocol failure message > refused from port 1021 addr 127.0.0.1 > > Sometimes, I get: > Sep 13 08:29:29 naboo pbs_mom: > LOG_ALERT::mom_server_valid_message_source, bad connect from > 127.0.0.1:1022 - unauthorized server > > I am running Redhat 5.6 64-bit, we have 4 queues (max_running = 1), > and we average about a 1000 qsubs per day (mostly small jobs, 1 > minute or less). When we were 2.4.11, MOM ran much better. I am > running out of ideas, so if you have a similar environment that > works, I would love to see your settings. For example, what options > did you 'configure' with? > > Thank you, Sam. Sam, Have you tried configuring TORQUE using --with-debug and then starting the MOM with gdb to see where the segfault occurs? Ken From vlad at cosy.sbg.ac.at Wed Sep 14 23:52:49 2011 From: vlad at cosy.sbg.ac.at (Vlad Popa) Date: Thu, 15 Sep 2011 07:52:49 +0200 Subject: [torqueusers] Job distributon does not what it is suppossed to do In-Reply-To: <83241193-1cc1-4d03-81a8-754886bd6c33@mail> References: <83241193-1cc1-4d03-81a8-754886bd6c33@mail> Message-ID: <4E7192B1.8080902@cosy.sbg.ac.at> Am 2011-09-14 22:39, schrieb Ken Nielson: > ----- Original Message ----- >> From: vlad at cosy.sbg.ac.at >> To: torqueusers at supercluster.org >> Sent: Wednesday, September 14, 2011 12:08:54 PM >> Subject: [torqueusers] Job distributon does not what it is suppossed to do >> >> Hi! >> >> I'm using torque version 3.0.3-snap.201107121616 >> I have setup several queues on a cluster with nodes containing >> gpus >> and other nodes with opteron CPUs . >> >> I have assigned the property "gpunode" to every node containig the >> Nvidia gpus and "opteron" to every node with our Magny Cours >> Opterons >> (which lack of any GPUs..). (Manual of torque subsection 4.1.4) >> >> One of my queues is called gpushort, the corespondent other ist >> optshort. >> The jobs should be directed to gpus when queued into the gpushort, >> else >> to the Opteron nodes if they are queued into "optshort". >> >> I'm using now Maui as scheduler, but also have tried for a short >> time >> pbs_sched with the same result. >> >> This is my output of pbsnodes: >> >> gpu01 >> state = free >> np = 8 >> properties = i7,i7-new,gpunode,16G >> ntype = cluster >> status = >> rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046 >> 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 >> 09:29:38 EDT 2011 x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 2 >> gpu_status = >> gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed >> Sep 14 19:45:19 2011 >> >> gpu02 >> state = free >> np = 8 >> properties = i7,12G,gpunode >> ntype = cluster >> status = >> rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux >> gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 2 >> gpu_status = >> gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed >> Sep 14 15:41:53 2011 >> >> gpu03 >> state = free >> np = 8 >> properties = fermi,12G,gpunode,i7 >> ntype = cluster >> status = >> rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux >> gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 1 >> gpu_status = >> gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14 >> 15:43:29 2011 >> >> gpu04 >> state = free >> np = 8 >> properties = i7,gpunode,12G >> ntype = cluster >> status = >> rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux >> gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 0 >> gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43 >> 2011 >> >> ... >> (and so on until gpu07..) >> >> .. >> (and my Opterons, further below ..) >> >> hex03 >> state = free >> np = 14 >> properties = opteron >> ntype = cluster >> status = >> rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux >> hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 0 >> >> hex04 >> state = free >> np = 14 >> properties = opteron >> ntype = cluster >> status = >> rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux >> hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 0 >> >> hex05 >> state = free >> np = 14 >> properties = opteron >> ntype = cluster >> status = >> rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux >> hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 0 >> >> ... >> (until we get to hex14 ...) >> hex14 >> state = free >> np = 14 >> properties = opteron >> ntype = cluster >> status = >> rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux >> hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 0 >> >> the configuration of the qmgr for the 2 queues: >> >> ... >> # >> # Create and define queue gpushort >> # >> create queue gpushort >> set queue gpushort queue_type = Execution >> set queue gpushort resources_default.neednodes = gpunode >> set queue gpushort resources_default.nodes = 1 >> set queue gpushort resources_default.walltime = 24:00:00 >> set queue gpushort enabled = True >> set queue gpushort started = True >> # >> # Create and define queue optshort >> # >> create queue optshort >> set queue optshort queue_type = Execution >> set queue optshort resources_default.neednodes = opteron >> set queue optshort resources_default.nodes = 1 >> set queue optshort resources_default.walltime = 24:00:00 >> set queue optshort enabled = True >> set queue optshort started = True >> # >> ... >> >> Now, If you submit jobs to gpushort, they get executed on the >> gpunodes >> (as it should be). If you choose to submit jobs to optshort, >> these are >> supposed to be executed by the opterons, but ,instead of that, they >> are >> found to be executed on the 1st gpunode (gpu01) as well. >> >> How can I change this bad behaviour ? >> >> I'm clueless... >> >> Any help appreciated.. >> >> Greetings from Salzburg/Austria/Europe >> >> Vlad Popa >> >> University of Salzburg >> Computer Science /HPC Computing >> 5020 Salzburg >> Austria >> Europe > We need someone to modify Maui to support GPUs. pbs_sched also does not support GPUs currently. Currently, only Moab knows about GPUs at the scheduler level. Yes, might be, but still my jobs in the queues are not directed to the right "property-nodes". I don't think, it would change, if I chose different property names. From shahsaifi at gmail.com Thu Sep 15 09:55:06 2011 From: shahsaifi at gmail.com (Shahnawaz Saifi) Date: Thu, 15 Sep 2011 15:55:06 +0000 (UTC) Subject: [torqueusers] Invitation to connect on LinkedIn Message-ID: <161582094.2319429.1316102106036.JavaMail.app@ela4-bed82.prod> I'd like to add you to my professional network on LinkedIn. - Shahnawaz Shahnawaz Saifi Systems Engineer at Clickable New Delhi Area, India Confirm that you know Shahnawaz Saifi: https://www.linkedin.com/e/-p5p7l5-gslx7oaa-4a/isd/4223591488/gxTpf4CZ/?hs=false&tok=2_i5J7bnyXIQU1 -- You are receiving Invitation to Connect emails. Click to unsubscribe: http://www.linkedin.com/e/-p5p7l5-gslx7oaa-4a/kQ-ZaN55HywCKNWxkMlnSwpebDvnCdMfbK-ZUhfHswD/goo/torqueusers%40supercluster%2Eorg/20061/I1459216576_1/?hs=false&tok=1Us1ke7CyXIQU1 (c) 2011 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110915/964e2405/attachment.html From jjc at iastate.edu Thu Sep 15 14:29:54 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 15 Sep 2011 15:29:54 -0500 Subject: [torqueusers] Job distributon does not what it is suppossed to do In-Reply-To: <4E7192B1.8080902@cosy.sbg.ac.at> References: <83241193-1cc1-4d03-81a8-754886bd6c33@mail> <4E7192B1.8080902@cosy.sbg.ac.at> Message-ID: Vlad Popa, I don't have users submit to a specific queue, I just have them specify needed resources, and let a routing queue decide what queue to run them in. You can do this in Maui or just plain old pbs_sched. If your users specify queues, you would have a qsub something like: qsub -q optshort -lnodes=2:ppn=16,walltime=1:00,vmem=32GB,pmem=2GB,mem=32GB ./script If they specify resources, this would be like: qsub -lnodes=2:ppn=16:opteron,walltime=1:00,vmem=32GB,pmem=3GB,mem=32GB ./script I let the default queue be a routing queue: set server default_queue = routing_queue set queue routing_queue queue_type = Route Set up routing from it into 5 queues: set queue routing_queue route_destinations = optshort set queue routing_queue route_destinations += gpushort set queue routing_queue route_destinations += medium set queue routing_queue route_destinations += large_short set queue routing_queue route_destinations += large And set all 5 queues to be from_route_only set queue large_short from_route_only = True set queue large from_route_only = True set queue medium from_route_only = True set queue optshort from_route_only = True set queue gpushort from_route_only = True Then the jobs traverses the list in order down until it can satisfy all resource requirements, even :opteron or :gpunode I created a wqeb page form my users. In this case I'd simply have radio buttons for need gpus? no/yes need opterons? no/yes need I7? no/yes Then the web page could generate the correct #PBS line. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Vlad Popa >Sent: Thursday, September 15, 2011 12:53 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] Job distributon does not what it is >suppossed to do > >Am 2011-09-14 22:39, schrieb Ken Nielson: >> ----- Original Message ----- >>> From: vlad at cosy.sbg.ac.at >>> To: torqueusers at supercluster.org >>> Sent: Wednesday, September 14, 2011 12:08:54 PM >>> Subject: [torqueusers] Job distributon does not what it is >suppossed to do >>> >>> Hi! >>> >>> I'm using torque version 3.0.3-snap.201107121616 >>> I have setup several queues on a cluster with nodes >containing >>> gpus >>> and other nodes with opteron CPUs . >>> >>> I have assigned the property "gpunode" to every node containig >the >>> Nvidia gpus and "opteron" to every node with our Magny Cours >>> Opterons >>> (which lack of any GPUs..). (Manual of torque subsection 4.1.4) >>> >>> One of my queues is called gpushort, the corespondent other ist >>> optshort. >>> The jobs should be directed to gpus when queued into the >gpushort, >>> else >>> to the Opteron nodes if they are queued into "optshort". >>> >>> I'm using now Maui as scheduler, but also have tried for a short >>> time >>> pbs_sched with the same result. >>> >>> This is my output of pbsnodes: >>> >>> gpu01 >>> state = free >>> np = 8 >>> properties = i7,i7-new,gpunode,16G >>> ntype = cluster >>> status = >>> >rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gre >s=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totme >m=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046 >>> 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul >15 >>> 09:29:38 EDT 2011 x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 2 >>> gpu_status = >>> >gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=2 >80.13,timestamp=Wed >>> Sep 14 19:45:19 2011 >>> >>> gpu02 >>> state = free >>> np = 8 >>> properties = i7,12G,gpunode >>> ntype = cluster >>> status = >>> >rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gre >s=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totme >m=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux >>> gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT >2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 2 >>> gpu_status = >>> >gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=2 >80.13,timestamp=Wed >>> Sep 14 15:41:53 2011 >>> >>> gpu03 >>> state = free >>> np = 8 >>> properties = fermi,12G,gpunode,i7 >>> ntype = cluster >>> status = >>> >rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres >=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem >=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux >>> gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT >2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 1 >>> gpu_status = >>> gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep >14 >>> 15:43:29 2011 >>> >>> gpu04 >>> state = free >>> np = 8 >>> properties = i7,gpunode,12G >>> ntype = cluster >>> status = >>> >rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gre >s=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totme >m=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux >>> gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT >2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 0 >>> gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 >21:46:43 >>> 2011 >>> >>> ... >>> (and so on until gpu07..) >>> >>> .. >>> (and my Opterons, further below ..) >>> >>> hex03 >>> state = free >>> np = 14 >>> properties = opteron >>> ntype = cluster >>> status = >>> >rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gre >s=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totm >em=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux >>> hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT >2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 0 >>> >>> hex04 >>> state = free >>> np = 14 >>> properties = opteron >>> ntype = cluster >>> status = >>> >rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gre >s=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totm >em=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux >>> hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT >2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 0 >>> >>> hex05 >>> state = free >>> np = 14 >>> properties = opteron >>> ntype = cluster >>> status = >>> >rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gr >es=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,tot >mem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux >>> hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT >2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 0 >>> >>> ... >>> (until we get to hex14 ...) >>> hex14 >>> state = free >>> np = 14 >>> properties = opteron >>> ntype = cluster >>> status = >>> >rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gre >s=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totm >em=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux >>> hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT >2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 0 >>> >>> the configuration of the qmgr for the 2 queues: >>> >>> ... >>> # >>> # Create and define queue gpushort >>> # >>> create queue gpushort >>> set queue gpushort queue_type = Execution >>> set queue gpushort resources_default.neednodes = gpunode >>> set queue gpushort resources_default.nodes = 1 >>> set queue gpushort resources_default.walltime = 24:00:00 >>> set queue gpushort enabled = True >>> set queue gpushort started = True >>> # >>> # Create and define queue optshort >>> # >>> create queue optshort >>> set queue optshort queue_type = Execution >>> set queue optshort resources_default.neednodes = opteron >>> set queue optshort resources_default.nodes = 1 >>> set queue optshort resources_default.walltime = 24:00:00 >>> set queue optshort enabled = True >>> set queue optshort started = True >>> # >>> ... >>> >>> Now, If you submit jobs to gpushort, they get executed on the >>> gpunodes >>> (as it should be). If you choose to submit jobs to optshort, >>> these are >>> supposed to be executed by the opterons, but ,instead of that, >they >>> are >>> found to be executed on the 1st gpunode (gpu01) as well. >>> >>> How can I change this bad behaviour ? >>> >>> I'm clueless... >>> >>> Any help appreciated.. >>> >>> Greetings from Salzburg/Austria/Europe >>> >>> Vlad Popa >>> >>> University of Salzburg >>> Computer Science /HPC Computing >>> 5020 Salzburg >>> Austria >>> Europe >> We need someone to modify Maui to support GPUs. pbs_sched also >does not support GPUs currently. Currently, only Moab knows about >GPUs at the scheduler level. >Yes, might be, but still my jobs in the queues are not directed to >the >right "property-nodes". I don't think, it would change, if I chose >different property names. > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From Eirikur.Hjartarson at decode.is Fri Sep 16 03:57:13 2011 From: Eirikur.Hjartarson at decode.is (=?iso-8859-1?Q?Eir=EDkur_Hjartarson?=) Date: Fri, 16 Sep 2011 09:57:13 +0000 Subject: [torqueusers] Two problems with a routing queue Message-ID: <4C3D9F9382FE07458AF8E79B234B4BD8AD917E@smbx.decode.is> Hi, I'm resubmitting these questions since I got no replies to them one week ago. In order to limit the number of jobs that maui considers for scheduling, we have a routing queue setup, # # Create and define queue exec # create queue exec set queue exec queue_type = Route set queue exec route_destinations = real_exec set queue exec route_held_jobs = False set queue exec enabled = True set queue exec started = True # # Create and define queue real_exec # create queue real_exec set queue real_exec queue_type = Execution set queue real_exec max_user_queuable = 800 set queue real_exec from_route_only = True set queue real_exec resources_default.nodes = 1 set queue real_exec enabled = True set queue real_exec started = True (800 is a bit higher than the number of CPUs in the cluster) There are two problems that we have experienced with this setup. 1. A job (id: 28379062), that is still on the "exec" queue and depends on another job (id: 28379059) that finishes *before* the job (id: 28379062) is put on the "real_exec" queue will generate the following error mail, when it (id: 28379062) is transferred to the "real_exec" queue. --- PBS Job Id: 28379062.lpbs2.decode.is Job Name: bambino_22892 Aborted by PBS Server Dependency request for job rejected by 28379059.lpbs2.decode.is Unknown Job Id Job held for unknown job dep, use 'qrls' to release --- Is there any way to solve this problem, other than setting the keep_completed attribute to some non-zero value? The problem with the keep_completed attribute is that we (think we) have to set it to a big value, say, one day. 2. The "real_exec" queue may get filled up with jobs that all depend on a job that is still on the "exec" queue. It seems possible to me that the route_held_jobs attribute only applies to user holds. If that is correct, would it be possible to let it also apply to system holds? Regards, -- Eir?kur Hjartarson From guilherme.consultor at gmail.com Fri Sep 16 05:34:52 2011 From: guilherme.consultor at gmail.com (Guilherme Rocha) Date: Fri, 16 Sep 2011 08:34:52 -0300 Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: References: Message-ID: Hello folks, thanks a lot for all answers, I'm doing a troubleshooting now, considering your answers, in order to get my target acquired. Thanks a lot once more. I already start to research how to use MPI, in fact we have the clustalw-mpi installed on the head node, but no in the nodes (yet), I will try to do your recommendations now. We alson have open-mpi installed. We are using Debian, all software installed via #aptitude. I have this question from my teacher, the lab coordinator - Is possible to run X programs (like clustalx) interactivelly under Torque? I'm familiar to terminal, but biologists are not. thanks once more dudes. Guilherme 2011/9/2 Coyle, James J [ITACD] > You?ll need and MPI application to use multiple nodes.**** > > ** ** > > Perhaps this application would be Clustalw-MPI .**** > > It looks like this is available at:**** > > > http://www.bii.a-star.edu.sg/achievements/applications/clustalw/download.php > **** > > Information on this can be found at:**** > > http://www.bii.a-star.edu.sg/docs/software/README.clustalw-mpi**** > > ** ** > > ** ** > > Torque just reserves node, you need something like MPi and program > written in MPI to use**** > > multiple nodes.**** > > ** ** > > If you need a suggestion on MPI, I use OpenMPI because it installs > easily and can use **** > > multiple network interconnects, it also works well with Torque.**** > > ** ** > > Ethernet works, but is slow, however, that is likely what you have.**** > > We use Infiniband and Myrinet networks in addition to Ethernet. They give > much better **** > > performance for our workloads, but the cards and switches are very > expensive.**** > > ** ** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Guilherme Rocha > *Sent:* Friday, September 02, 2011 4:18 AM > *To:* torqueusers at supercluster.org > > *Subject:* [torqueusers] Success setting up a new Torque Environment in > University**** > > ** ** > > Hello folks, > > > > My name is Guilherme and this is my first post here. > Thanks for this great project. > > We're setting-up a Torque Cluster with 23 nodes and will be used to > bioinformatics tasks. > > I'm completely newbie to all of this, so after hard steps of > troubleshooting, we finally > received good news in logs. We did a small alginment using clustalw. > > But we have some doubts about how to use a program in parallel, like: > > > Question 1) I need to have clustalw (or the script programs installed in > all nodes?) > Clustalw is only installed in head node by now. > > Question 2: Can we use/open GUI program's interfaces to work using > torque? > > Question 3: When I submit a job, even requesting 10 nodes, clustal are > being runned > only in one node. What can be wrong? > > > thanks in advance > > thanks in advance, > > > -- > -- > Guilherme Rocha > GF7 Doc & Systems - Solu??es Tecnol?gicas > Pesquisa e Desenvolvimento - World Wide > R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 > Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- -- Guilherme Rocha GF7 Doc & Systems - Solu??es Tecnol?gicas Pesquisa e Desenvolvimento - World Wide R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110916/47327e4c/attachment.html From knielson at adaptivecomputing.com Fri Sep 16 07:07:19 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 16 Sep 2011 07:07:19 -0600 (MDT) Subject: [torqueusers] Two problems with a routing queue In-Reply-To: <4C3D9F9382FE07458AF8E79B234B4BD8AD917E@smbx.decode.is> Message-ID: ----- Original Message ----- > From: "Eir?kur Hjartarson" > To: torqueusers at supercluster.org > Sent: Friday, September 16, 2011 3:57:13 AM > Subject: [torqueusers] Two problems with a routing queue > > Hi, > > I'm resubmitting these questions since I got no replies to them one > week ago. > > In order to limit the number of jobs that maui considers for > scheduling, we have a routing queue setup, > > # > # Create and define queue exec > # > create queue exec > set queue exec queue_type = Route > set queue exec route_destinations = real_exec > set queue exec route_held_jobs = False > set queue exec enabled = True > set queue exec started = True > # > # Create and define queue real_exec > # > create queue real_exec > set queue real_exec queue_type = Execution > set queue real_exec max_user_queuable = 800 > set queue real_exec from_route_only = True > set queue real_exec resources_default.nodes = 1 > set queue real_exec enabled = True > set queue real_exec started = True > > (800 is a bit higher than the number of CPUs in the cluster) > > There are two problems that we have experienced with this setup. > > 1. > > A job (id: 28379062), that is still on the "exec" queue and depends > on another job (id: 28379059) that finishes *before* the job (id: > 28379062) is put on the "real_exec" queue will generate the > following error mail, when it (id: 28379062) is transferred to the > "real_exec" queue. > > --- > PBS Job Id: 28379062.lpbs2.decode.is > Job Name: bambino_22892 > Aborted by PBS Server > Dependency request for job rejected by 28379059.lpbs2.decode.is > Unknown Job Id Job held for unknown job dep, use 'qrls' to release > --- > > Is there any way to solve this problem, other than setting the > keep_completed attribute to some non-zero value? The problem with > the keep_completed attribute is that we (think we) have to set it to > a big value, say, one day. When you set a job dependency TORQUE needs to know which job and under what conditions. If there is no record of a job TORQUE does not know what to do. Did the job finish correctly? did it fail? Why can you not submit the second job while the first job is still available? Ken Nielson Adaptive Computing From R.M.Krug at gmail.com Thu Sep 15 08:13:32 2011 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Thu, 15 Sep 2011 16:13:32 +0200 Subject: [torqueusers] qpeek on array job Message-ID: Hi I am trying to use qpeek to monitor an array job, but I can't figure out how to do it. Could anybody advise? qsub -t 1-10 someScript.sub something like: 689903[].head started How can I use qpeek? The following does not work: rkrug at head002:~> qpeek 689903[1] qstat: Unknown Job Id 689903[1].head002.sun.ac.za Job 689903[1] is not running! rkrug at head002:~> qpeek 689903 qstat: Unknown Job Id 689903.head002.sun.ac.za Job 689903 is not running! rkrug at head002:~> qpeek 689903-1 qstat: illegally formed job identifier: 689903-1 Job 689903-1 is not running! rkrug at head002:~> qpeek 689903[] cat: /var/spool/torque/spool/689903[].head002.sun.ac.za.OU: No such file or directory rkrug at head002:~> Any ideas welcome, Rainer From basv at sara.nl Fri Sep 16 07:34:09 2011 From: basv at sara.nl (Bas van der Vlies) Date: Fri, 16 Sep 2011 15:34:09 +0200 Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: References: Message-ID: <9980CD09-85BC-4967-876D-10F42B37FA20@sara.nl> On 16 sep 2011, at 13:34, Guilherme Rocha wrote: > > > > Hello folks, > > > > thanks a lot for all answers, I'm doing a troubleshooting now, considering your answers, in order to get my target acquired. > Thanks a lot once more. > > I already start to research how to use MPI, in fact we have the clustalw-mpi installed on the head node, > but no in the nodes (yet), I will try to do your recommendations now. > > We alson have open-mpi installed. > > > We are using Debian, all software installed via #aptitude. > > > I have this question from my teacher, the lab coordinator > Is possible to run X programs (like clustalx) interactivelly under Torque? > qsub - I -X -I: interactively -X: X11 forwarding You get a shell prompt: * xcalc PS) man qsub > > > > > 2011/9/2 Coyle, James J [ITACD] > You?ll need and MPI application to use multiple nodes. > > > > Perhaps this application would be Clustalw-MPI . > > It looks like this is available at: > > http://www.bii.a-star.edu.sg/achievements/applications/clustalw/download.php > > Information on this can be found at: > > http://www.bii.a-star.edu.sg/docs/software/README.clustalw-mpi > > > > > > Torque just reserves node, you need something like MPi and program written in MPI to use > > multiple nodes. > > > > If you need a suggestion on MPI, I use OpenMPI because it installs easily and can use > > multiple network interconnects, it also works well with Torque. > > > > Ethernet works, but is slow, however, that is likely what you have. > > We use Infiniband and Myrinet networks in addition to Ethernet. They give much better > > performance for our workloads, but the cards and switches are very expensive. > > > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Guilherme Rocha > Sent: Friday, September 02, 2011 4:18 AM > To: torqueusers at supercluster.org > > > Subject: [torqueusers] Success setting up a new Torque Environment in University > > > > Hello folks, > > > > > My name is Guilherme and this is my first post here. > Thanks for this great project. > > We're setting-up a Torque Cluster with 23 nodes and will be used to bioinformatics tasks. > > I'm completely newbie to all of this, so after hard steps of troubleshooting, we finally > received good news in logs. We did a small alginment using clustalw. > > But we have some doubts about how to use a program in parallel, like: > > > Question 1) I need to have clustalw (or the script programs installed in all nodes?) > Clustalw is only installed in head node by now. > > Question 2: Can we use/open GUI program's interfaces to work using torque? > > Question 3: When I submit a job, even requesting 10 nodes, clustal are being runned > only in one node. What can be wrong? > > > thanks in advance > > thanks in advance, > > > -- > -- > Guilherme Rocha > GF7 Doc & Systems - Solu??es Tecnol?gicas > Pesquisa e Desenvolvimento - World Wide > R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 > Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > -- > Guilherme Rocha > GF7 Doc & Systems - Solu??es Tecnol?gicas > Pesquisa e Desenvolvimento - World Wide > R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 > Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br > > -- Bas van der Vlies basv at sara.nl -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110916/7b490cee/attachment-0001.html From j.blank at fz-juelich.de Fri Sep 16 08:43:27 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Fri, 16 Sep 2011 16:43:27 +0200 Subject: [torqueusers] Undead job on node Message-ID: Hello everyone, One of my colleagues dropped an array job with 1300 tasks on our Torque2.5.8/Maui cluster. That nearly halted the scheduler, but also created 2 nameless zombie jobs on 2 nodes. Those 2 jobs do not appear in qstat and Maui, but appear to block a processor, so every job scheduled on that slot gets deferred by Maui. See the jobs line in this pbsnodes output: & pbsnodes c-14 c-14 state = offline np = 8 properties = barcelona,bigmem ntype = cluster jobs = 6/ status = rectime=1316183711,varattr=,jobs=,state=free,netload=30221252289,gres=,loadave=0.27,ncpus=8,physmem=66180812kb,availmem=131971012kb,totmem=133289668kb,idletime=705610,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux c-14 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC 2011 x86_64,opsys=linux gpus = 0 Regards, J?rg Blank From jjc at iastate.edu Fri Sep 16 09:21:07 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 16 Sep 2011 10:21:07 -0500 Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: <9980CD09-85BC-4967-876D-10F42B37FA20@sara.nl> References: <9980CD09-85BC-4967-876D-10F42B37FA20@sara.nl> Message-ID: Guilherme Rocha, To add to the options Bas van der Vlies sent to you, I suggest users also add the -V option to qsub. Make sure to add all the -lnodes=...,walltime=... options to the command line sinceyou are not submitting a script. E.g. qsub -V -I -X -l nodes=5:ppn=16,walltime=3:00:00 Also, clustallw and OpenMPI must be installed on all nodes for this to work. I use (and recommend) a common network mounted (e.g. NFS) /opt directory which is hosted from my head node so that I keep a consistent set of binaries and libraries on all nodes. Why I did it this way: ------------------------ I did this because early on in cluster computing, I was a user on a cluster which had locally mounted non-system libraries. I had lots of jobs fail because one node in the set that my job ran on did not have the latest version of some libxxx.so library. This wasted lots of time for both me and the sysadmin, especially because I was running from several time zones away, making real-time responses nearly impossible, and because the jobs did not run immediately, so I would not know that the job failed until the next day. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Bas van der Vlies Sent: Friday, September 16, 2011 8:34 AM To: Torque Users Mailing List Cc: rbzucoloto at gmail.com; vitorvilas at gmail.com Subject: Re: [torqueusers] Success setting up a new Torque Environment in University On 16 sep 2011, at 13:34, Guilherme Rocha wrote: Hello folks, thanks a lot for all answers, I'm doing a troubleshooting now, considering your answers, in order to get my target acquired. Thanks a lot once more. I already start to research how to use MPI, in fact we have the clustalw-mpi installed on the head node, but no in the nodes (yet), I will try to do your recommendations now. We alson have open-mpi installed. We are using Debian, all software installed via #aptitude. I have this question from my teacher, the lab coordinator * Is possible to run X programs (like clustalx) interactivelly under Torque? qsub - I -X -I: interactively -X: X11 forwarding You get a shell prompt: * xcalc PS) man qsub 2011/9/2 Coyle, James J [ITACD] > You'll need and MPI application to use multiple nodes. Perhaps this application would be Clustalw-MPI . It looks like this is available at: http://www.bii.a-star.edu.sg/achievements/applications/clustalw/download.php Information on this can be found at: http://www.bii.a-star.edu.sg/docs/software/README.clustalw-mpi Torque just reserves node, you need something like MPi and program written in MPI to use multiple nodes. If you need a suggestion on MPI, I use OpenMPI because it installs easily and can use multiple network interconnects, it also works well with Torque. Ethernet works, but is slow, however, that is likely what you have. We use Infiniband and Myrinet networks in addition to Ethernet. They give much better performance for our workloads, but the cards and switches are very expensive. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Guilherme Rocha Sent: Friday, September 02, 2011 4:18 AM To: torqueusers at supercluster.org Subject: [torqueusers] Success setting up a new Torque Environment in University Hello folks, My name is Guilherme and this is my first post here. Thanks for this great project. We're setting-up a Torque Cluster with 23 nodes and will be used to bioinformatics tasks. I'm completely newbie to all of this, so after hard steps of troubleshooting, we finally received good news in logs. We did a small alginment using clustalw. But we have some doubts about how to use a program in parallel, like: Question 1) I need to have clustalw (or the script programs installed in all nodes?) Clustalw is only installed in head node by now. Question 2: Can we use/open GUI program's interfaces to work using torque? Question 3: When I submit a job, even requesting 10 nodes, clustal are being runned only in one node. What can be wrong? thanks in advance thanks in advance, -- -- Guilherme Rocha GF7 Doc & Systems - Solu??es Tecnol?gicas Pesquisa e Desenvolvimento - World Wide R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- -- Guilherme Rocha GF7 Doc & Systems - Solu??es Tecnol?gicas Pesquisa e Desenvolvimento - World Wide R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br -- Bas van der Vlies basv at sara.nl -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110916/9e16fef7/attachment-0001.html From decicco10 at gmail.com Fri Sep 16 09:29:12 2011 From: decicco10 at gmail.com (Marcelo De Cicco) Date: Fri, 16 Sep 2011 12:29:12 -0300 Subject: [torqueusers] Nodes going crazy Message-ID: hello!! Week ago, we installed infiniband, since then the nodes has been crazy: WARNING: active job '147' has inactive node n012 allocated for 1:20:52:21 (node state: 'Down') WARNING: active job '142' has inactive node n012 allocated for 1:22:50:19 (node state: 'Down') WARNING: active job '143' has inactive node n012 allocated for 1:22:48:38 (node state: 'Down') WARNING: active job '144' has inactive node n012 allocated for 1:22:47:24 (node state: 'Down') WARNING: active job '145' has inactive node n012 allocated for 1:22:45:41 (node state: 'Down') WARNING: active job '146' has inactive node n012 allocated for 1:22:44:26 (node state: 'Down') WARNING: active job '148' has inactive node n008 allocated for 1:03:19:34 (node state: 'Down') WARNING: active job '150' has inactive node n008 allocated for 2:42:46 (node state: 'Down') I restart the pbs_mom in the nodes, but nothing happens. And suddenly , the nodes that was down, rises again! Marcelo De Cicco ** "Antes de imprimir, pense no Meio Ambiente e nos Custos" * " THE MORE PROGRESS PHYSICAL SCIENCES MAKE, THE MORE THEY TEND TO ENTER THE DOMAIN OF MATHEMATICS, WHICH IS A KIND OF CENTRE TO WHICH THEY ALL CONVERGE. WE MAY EVEN JUDGE THE DEGREE OF PERFECTION TO WHICH A SCIENCE HAS ARRIVED BY THE FACILITY WITH WHICH IT MAY BE SUBMITTED TO CALCULATION" . -- ADOLPHE QUETELET, 1796-1874 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110916/5bb9e1b7/attachment.html From jjc at iastate.edu Fri Sep 16 09:32:18 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 16 Sep 2011 10:32:18 -0500 Subject: [torqueusers] qpeek on array job : Workarounds In-Reply-To: References: Message-ID: If you are the cluster admin, you could try setting adding the line $spool_as_final_name true to all your MOM comfigs and restart, then you can just monitor the output in the -o and -e output files. You can always use qstat -n to see which node is the first node in the list that the job is running on, and login to that node and look at the appropriate file in /var/spool/torque/spool It will be of the form jobname.ER or jobname.OU and should be owned by your username. This is all that qpeek does. Once you figure out what is happening, you could look at qpeek and modify it to handle job arrays, and contribute the new script as a possible replacement. >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Rainer M Krug >Sent: Thursday, September 15, 2011 9:14 AM >To: torqueusers at supercluster.org >Subject: [torqueusers] qpeek on array job > >Hi > >I am trying to use qpeek to monitor an array job, but I can't figure >out >how to do it. Could anybody advise? > >qsub -t 1-10 someScript.sub > >something like: > >689903[].head started > >How can I use qpeek? > >The following does not work: > >rkrug at head002:~> qpeek 689903[1] >qstat: Unknown Job Id 689903[1].head002.sun.ac.za >Job 689903[1] is not running! >rkrug at head002:~> qpeek 689903 >qstat: Unknown Job Id 689903.head002.sun.ac.za >Job 689903 is not running! >rkrug at head002:~> qpeek 689903-1 >qstat: illegally formed job identifier: 689903-1 >Job 689903-1 is not running! >rkrug at head002:~> qpeek 689903[] >cat: /var/spool/torque/spool/689903[].head002.sun.ac.za.OU: No such >file >or directory >rkrug at head002:~> > >Any ideas welcome, > >Rainer > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Fri Sep 16 10:05:25 2011 From: glen.beane at gmail.com (Glen Beane) Date: Fri, 16 Sep 2011 12:05:25 -0400 Subject: [torqueusers] qpeek on array job In-Reply-To: References: Message-ID: On Sep 15, 2011, at 10:13 AM, Rainer M Krug wrote: > Hi > > I am trying to use qpeek to monitor an array job, but I can't figure out > how to do it. Could anybody advise? > > qsub -t 1-10 someScript.sub > > something like: > > 689903[].head started > > How can I use qpeek? > > The following does not work: > > rkrug at head002:~> qpeek 689903[1] > qstat: Unknown Job Id 689903[1].head002.sun.ac.za I think you might need to pass -t to qstat to query jobs within the array > Job 689903[1] is not running! > rkrug at head002:~> qpeek 689903 > qstat: Unknown Job Id 689903.head002.sun.ac.za > Job 689903 is not running! > rkrug at head002:~> qpeek 689903-1 > qstat: illegally formed job identifier: 689903-1 > Job 689903-1 is not running! > rkrug at head002:~> qpeek 689903[] > cat: /var/spool/torque/spool/689903[].head002.sun.ac.za.OU: No such file > or directory > rkrug at head002:~> > > Any ideas welcome, > > Rainer > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Eirikur.Hjartarson at decode.is Fri Sep 16 10:08:01 2011 From: Eirikur.Hjartarson at decode.is (=?iso-8859-1?Q?Eir=EDkur_Hjartarson?=) Date: Fri, 16 Sep 2011 16:08:01 +0000 Subject: [torqueusers] Two problems with a routing queue In-Reply-To: References: <4C3D9F9382FE07458AF8E79B234B4BD8AD917E@smbx.decode.is> Message-ID: <4C3D9F9382FE07458AF8E79B234B4BD8AD9504@smbx.decode.is> Hi, > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > bounces at supercluster.org] On Behalf Of Ken Nielson > Sent: 16. september 2011 13:07 > To: Torque Users Mailing List > Subject: Re: [torqueusers] Two problems with a routing queue > > > > > Hi, > > > > I'm resubmitting these questions since I got no replies to them one > > week ago. > > > > In order to limit the number of jobs that maui considers for > > scheduling, we have a routing queue setup, > > > > # > > # Create and define queue exec > > # > > create queue exec > > set queue exec queue_type = Route > > set queue exec route_destinations = real_exec > > set queue exec route_held_jobs = False > > set queue exec enabled = True > > set queue exec started = True > > # > > # Create and define queue real_exec > > # > > create queue real_exec > > set queue real_exec queue_type = Execution > > set queue real_exec max_user_queuable = 800 > > set queue real_exec from_route_only = True > > set queue real_exec resources_default.nodes = 1 > > set queue real_exec enabled = True > > set queue real_exec started = True > > > > (800 is a bit higher than the number of CPUs in the cluster) > > > > There are two problems that we have experienced with this setup. > > > > 1. > > > > A job (id: 28379062), that is still on the "exec" queue and depends > > on another job (id: 28379059) that finishes *before* the job (id: > > 28379062) is put on the "real_exec" queue will generate the > > following error mail, when it (id: 28379062) is transferred to the > > "real_exec" queue. > > > > --- > > PBS Job Id: 28379062.lpbs2.decode.is > > Job Name: bambino_22892 > > Aborted by PBS Server > > Dependency request for job rejected by 28379059.lpbs2.decode.is > > Unknown Job Id Job held for unknown job dep, use 'qrls' to release > > --- > > > > Is there any way to solve this problem, other than setting the > > keep_completed attribute to some non-zero value? The problem with > > the keep_completed attribute is that we (think we) have to set it to > > a big value, say, one day. > > When you set a job dependency TORQUE needs to know which job and > under what conditions. If there is no record of a job TORQUE does not know > what to do. Did the job finish correctly? did it fail? > > Why can you not submit the second job while the first job is still available? Thanks for your response, I probably did a bad job of explaining the problem. The jobs were submitted to the "exec" queue at the same time. Now the first job (28379059) is moved to the "real_exec" queue and finishes executing before the second job (28379062) is moved to the "real_exec" queue. At that time, when the second job is moved to the "real_exec" queue, the error mail is sent. This problem is solvable by setting the "keep_completed" attribute for the "real_exec" queue to some non-zero value. In our case that may be several hours and e.g. output from "qstat" is cluttered by information on completed jobs. Which is why I am asking if there is some other solution. The second problem I mentioned is more critical for us, It seems that jobs that are on system hold (because of dependencies) are transferred from the "exec" queue to the "real_exec" queue, regardless of the setting of the "route_held_jobs" attribute. On the other hand, jobs, with user holds, stay on the "exec" queue if the "route_held_jobs" attribute is set. Regards, -- Eir?kur Hjartarson E-mail: Eirikur.Hjartarson at decode.is ?slensk Erf?agreining Mobile: +3546641898 Sturlug?tu 7 IS-101 Reykjav?k From gus at ldeo.columbia.edu Fri Sep 16 10:28:21 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 16 Sep 2011 12:28:21 -0400 Subject: [torqueusers] Two problems with a routing queue In-Reply-To: References: Message-ID: <4E737925.30300@ldeo.columbia.edu> Ken Nielson wrote: > > ----- Original Message ----- >> From: "Eir?kur Hjartarson" >> To: torqueusers at supercluster.org >> Sent: Friday, September 16, 2011 3:57:13 AM >> Subject: [torqueusers] Two problems with a routing queue >> >> Hi, >> >> I'm resubmitting these questions since I got no replies to them one >> week ago. >> >> In order to limit the number of jobs that maui considers for >> scheduling, we have a routing queue setup, >> >> # >> # Create and define queue exec >> # >> create queue exec >> set queue exec queue_type = Route >> set queue exec route_destinations = real_exec >> set queue exec route_held_jobs = False >> set queue exec enabled = True >> set queue exec started = True >> # >> # Create and define queue real_exec >> # >> create queue real_exec >> set queue real_exec queue_type = Execution >> set queue real_exec max_user_queuable = 800 >> set queue real_exec from_route_only = True >> set queue real_exec resources_default.nodes = 1 >> set queue real_exec enabled = True >> set queue real_exec started = True >> >> (800 is a bit higher than the number of CPUs in the cluster) >> >> There are two problems that we have experienced with this setup. >> >> 1. >> >> A job (id: 28379062), that is still on the "exec" queue and depends >> on another job (id: 28379059) that finishes *before* the job (id: >> 28379062) is put on the "real_exec" queue will generate the >> following error mail, when it (id: 28379062) is transferred to the >> "real_exec" queue. >> >> --- >> PBS Job Id: 28379062.lpbs2.decode.is >> Job Name: bambino_22892 >> Aborted by PBS Server >> Dependency request for job rejected by 28379059.lpbs2.decode.is >> Unknown Job Id Job held for unknown job dep, use 'qrls' to release >> --- >> >> Is there any way to solve this problem, other than setting the >> keep_completed attribute to some non-zero value? The problem with >> the keep_completed attribute is that we (think we) have to set it to >> a big value, say, one day. > > When you set a job dependency TORQUE needs to know which job and under what conditions. > If there is no record of a job TORQUE does not know what to do. > Did the job finish correctly? did it fail? > > Why can you not submit the second job while the first job is still available? > > Ken Nielson > Adaptive Computing > Hi Eirikur and Ken I had a similar problem some time ago, and I found it useful to extend the time of completed jobs on the queue. Note that the unit used is seconds. If you don't have a high volume of jobs this is not a problem qmgr -c 'set server keep_completed = the number of seconds you want' Also, and this may be a question to Ken as well. What makes 'afterok' to be true? Is it an empty stderr? Something else? Often times programs dump warning messages [not errors] in stderr, the job ends 'OK' but stderr is not empty. I prefer to use 'afterany' because of this doubt. Thank you, Gus Correa From gus at ldeo.columbia.edu Fri Sep 16 10:48:00 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 16 Sep 2011 12:48:00 -0400 Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: <9980CD09-85BC-4967-876D-10F42B37FA20@sara.nl> References: <9980CD09-85BC-4967-876D-10F42B37FA20@sara.nl> Message-ID: <4E737DC0.2000307@ldeo.columbia.edu> Bas van der Vlies wrote: > > On 16 sep 2011, at 13:34, Guilherme Rocha wrote: > >> >> >> >> Hello folks, >> >> >> >> thanks a lot for all answers, I'm doing a troubleshooting now, >> considering your answers, in order to get my target acquired. >> Thanks a lot once more. >> >> I already start to research how to use MPI, in fact we have the >> clustalw-mpi installed on the head node, >> but no in the nodes (yet), I will try to do your recommendations now. >> >> We alson have open-mpi installed. >> >> >> We are using Debian, all software installed via #aptitude. >> >> >> I have this question from my teacher, the lab coordinator >> >> * Is possible to run X programs (like clustalx) interactivelly >> under Torque? >> >> > qsub - I -X > > -I: interactively > -X: X11 forwarding > > You get a shell prompt: > * xcalc > > PS) man qsub > Hi Guilherme and Bas Guilherme: Here use the 'interctive+X_enabled' feature of Torque as Bas explained, and it works fine. However, if your compute nodes are only on a private subnet [as in most clusters], perhaps doing NAT to the outer world via head node, a user running an interactive+X_enabled job will probably have to hop from his local machine to the head node (via ssh -X), and launch the job from there. This leads to increased network traffic across the head node, particularly if the program is heavy on graphics. Here this type of jobs are don't happen very often [to my relief]. Nevertheless, they typically use Matlab, IDL, this type of multi-windowed interactive tool that can generate quite a bit of graphic displays, etc, and large network traffic. I don't know anything about clustalw, but I would guess it has some of these charactheristics. In any case, I presume this is something a system administrator has to be aware of, and perhaps restrict the number of simultaneous interactive+X_enabled jobs, say, via a specific queue, to avoid a bottleneck. Other people on the list with more experience with interactive+X_enabled jobs could perhaps post their views. Thank you, Gus Correa >> >> >> >> >> 2011/9/2 Coyle, James J [ITACD] > >> >> You?ll need and MPI application to use multiple nodes.____ >> >> __ __ >> >> Perhaps this application would be Clustalw-MPI .____ >> >> It looks like this is available at:____ >> >> http://www.bii.a-star.edu.sg/achievements/applications/clustalw/download.php____ >> >> Information on this can be found at:____ >> >> http://www.bii.a-star.edu.sg/docs/software/README.clustalw-mpi____ >> >> __ __ >> >> __ __ >> >> Torque just reserves node, you need something like MPi and >> program written in MPI to use____ >> >> multiple nodes.____ >> >> __ __ >> >> If you need a suggestion on MPI, I use OpenMPI because it >> installs easily and can use ____ >> >> multiple network interconnects, it also works well with Torque.____ >> >> __ __ >> >> Ethernet works, but is slow, however, that is likely what you >> have.____ >> >> We use Infiniband and Myrinet networks in addition to Ethernet. >> They give much better ____ >> >> performance for our workloads, but the cards and switches are very >> expensive.____ >> >> __ __ >> >> __ __ >> >> *From:* torqueusers-bounces at supercluster.org >> >> [mailto:torqueusers-bounces at supercluster.org >> ] *On Behalf Of >> *Guilherme Rocha >> *Sent:* Friday, September 02, 2011 4:18 AM >> *To:* torqueusers at supercluster.org >> >> >> >> *Subject:* [torqueusers] Success setting up a new Torque >> Environment in University____ >> >> __ __ >> >> Hello folks, >> >> >> >> >> My name is Guilherme and this is my first post here. >> Thanks for this great project. >> >> We're setting-up a Torque Cluster with 23 nodes and will be used >> to bioinformatics tasks. >> >> I'm completely newbie to all of this, so after hard steps of >> troubleshooting, we finally >> received good news in logs. We did a small alginment using clustalw. >> >> But we have some doubts about how to use a program in parallel, like: >> >> >> Question 1) I need to have clustalw (or the script programs >> installed in all nodes?) >> Clustalw is only installed in head node by now. >> >> Question 2: Can we use/open GUI program's interfaces to work >> using torque? >> >> Question 3: When I submit a job, even requesting 10 nodes, clustal >> are being runned >> only in one node. What can be wrong? >> >> >> thanks in advance >> >> thanks in advance, >> >> >> -- >> -- >> Guilherme Rocha >> GF7 Doc & Systems - Solu??es Tecnol?gicas >> Pesquisa e Desenvolvimento - World Wide >> R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 >> Mobile: +55 51 81400360 - Home >> Page: http://www.gf7.com.br ____ >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> >> -- >> -- >> Guilherme Rocha >> GF7 Doc & Systems - Solu??es Tecnol?gicas >> Pesquisa e Desenvolvimento - World Wide >> R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 >> Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br >> >> >> > > -- > Bas van der Vlies > basv at sara.nl > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From tbaer at utk.edu Fri Sep 16 12:25:47 2011 From: tbaer at utk.edu (Troy Baer) Date: Fri, 16 Sep 2011 14:25:47 -0400 Subject: [torqueusers] qpeek on array job In-Reply-To: References: Message-ID: <1316197547.27031.4.camel@browncoat.jics.utk.edu> On Thu, 2011-09-15 at 16:13 +0200, Rainer M Krug wrote: > I am trying to use qpeek to monitor an array job, but I can't figure out > how to do it. Could anybody advise? > > qsub -t 1-10 someScript.sub > > something like: > > 689903[].head started > > How can I use qpeek? > > The following does not work: > > rkrug at head002:~> qpeek 689903[1] > qstat: Unknown Job Id 689903[1].head002.sun.ac.za > Job 689903[1] is not running! > rkrug at head002:~> qpeek 689903 > qstat: Unknown Job Id 689903.head002.sun.ac.za > Job 689903 is not running! > rkrug at head002:~> qpeek 689903-1 > qstat: illegally formed job identifier: 689903-1 > Job 689903-1 is not running! > rkrug at head002:~> qpeek 689903[] > cat: /var/spool/torque/spool/689903[].head002.sun.ac.za.OU: No such file > or directory > rkrug at head002:~> Try 689903-1 rather than 689903[1], as the former is how the job is actually named from TORQUE's PoV. BTW, qpeek predates TORQUE's support for job arrays, so it really doesn't know anything about them. If you come up with a patch to make this work, please send it to me. --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From dbeer at adaptivecomputing.com Fri Sep 16 13:32:31 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 16 Sep 2011 13:32:31 -0600 (MDT) Subject: [torqueusers] qpeek on array job In-Reply-To: <1316197547.27031.4.camel@browncoat.jics.utk.edu> Message-ID: <14dff659-c8b9-47f0-9229-24cb3990726c@mail> ----- Original Message ----- > On Thu, 2011-09-15 at 16:13 +0200, Rainer M Krug wrote: > > I am trying to use qpeek to monitor an array job, but I can't > > figure out > > how to do it. Could anybody advise? > > > > qsub -t 1-10 someScript.sub > > > > something like: > > > > 689903[].head started > > > > How can I use qpeek? > > > > The following does not work: > > > > rkrug at head002:~> qpeek 689903[1] > > qstat: Unknown Job Id 689903[1].head002.sun.ac.za > > Job 689903[1] is not running! > > rkrug at head002:~> qpeek 689903 > > qstat: Unknown Job Id 689903.head002.sun.ac.za > > Job 689903 is not running! > > rkrug at head002:~> qpeek 689903-1 > > qstat: illegally formed job identifier: 689903-1 > > Job 689903-1 is not running! > > rkrug at head002:~> qpeek 689903[] > > cat: /var/spool/torque/spool/689903[].head002.sun.ac.za.OU: No such > > file > > or directory > > rkrug at head002:~> > > Try 689903-1 rather than 689903[1], as the former is how the job is > actually named from TORQUE's PoV. > > BTW, qpeek predates TORQUE's support for job arrays, so it really > doesn't know anything about them. If you come up with a patch to > make > this work, please send it to me. > I second that for qpeek - send it to me and I'll fix it in TORQUE. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1656 S. East Bay Blvd. Suite #300 Provo, UT 84606 From j.blank at fz-juelich.de Sat Sep 17 05:33:11 2011 From: j.blank at fz-juelich.de (Joerg Blank) Date: Sat, 17 Sep 2011 13:33:11 +0200 Subject: [torqueusers] Undead job on node In-Reply-To: References: Message-ID: Hello everyone, a little update from my zombie job: Seems like the pointer to the jobname is dangling: pbsnodes c-2 c-2 state = offline np = 24 properties = magnycours,smallmem ntype = cluster jobs = 12/arch_temp/retrieval_3d_40 status = rectime=1316258984,varattr=,jobs=,state=free,netload=38066458034,gres=,loadave=0.24,ncpus=24,physmem=66110892kb,availmem=131807188kb,totmem=133219748kb,idletime=569301,nusers=1,nsessions=1,sessions=23460,uname=Linux c-2 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC 2011 x86_64,opsys=linux gpus = 0 It also sometimes points to environment variables of random jobs. Regards, J?rg Blank From pablo.fernandez at cscs.ch Wed Sep 21 08:23:30 2011 From: pablo.fernandez at cscs.ch (Pablo Fernandez) Date: Wed, 21 Sep 2011 16:23:30 +0200 Subject: [torqueusers] Missing E entries Message-ID: <201109211623.30744.pablo.fernandez@cscs.ch> Dear all, I have just realized that, in our server_priv/accounting files there are Exit entries missing (those with ;E;). This seems to happen when you have Delete entries only (those with ;D;), but not the other way around. I mean: - All ;E; missing entries have a corresponding ;D; entry - Not all ;D; entries have a corresponding ;E; entry. I thought that, whenever there is a ;D; entry, there should also be a ;E; entry... but aparently that's not always the case. This is indeed quite bad for us, because most accounting is done parsing ;E; entries. Does any of you know why could this be? We're running 2.4.16, but this happened at least also with 2.4.13. Thanks! Pablo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110921/ae27cce9/attachment.html From samuel at unimelb.edu.au Mon Sep 26 05:30:06 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 26 Sep 2011 21:30:06 +1000 Subject: [torqueusers] Missing E entries In-Reply-To: <201109211623.30744.pablo.fernandez@cscs.ch> References: <201109211623.30744.pablo.fernandez@cscs.ch> Message-ID: <4E80623E.2090201@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 22/09/11 00:23, Pablo Fernandez wrote: > I thought that, whenever there is a ;D; entry, there should also be a > ;E; entry... but aparently that's not always the case. This is indeed > quite bad for us, because most accounting is done parsing ;E; entries. > > > Does any of you know why could this be? We're running 2.4.16, but this > happened at least also with 2.4.13. I'm pretty sure we're seeing the same issue here too, with the same version of Torque. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6AYj4ACgkQO2KABBYQAh/IoACfU9FuD+pK3vhcJfQZh12Nd3e2 TokAnjizNf+DW+KjaHp9u/nW53fAZFoq =wRFd -----END PGP SIGNATURE----- From Ole.H.Nielsen at fysik.dtu.dk Tue Sep 27 06:33:26 2011 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Tue, 27 Sep 2011 14:33:26 +0200 Subject: [torqueusers] ANNOUNCE: pestat v.2.11: Print a 1-line summary of jobs on each node Message-ID: <4E81C296.6090004@fysik.dtu.dk> Dear Torque users, There is an updated pestat version 2.11 available from ftp://ftp.fysik.dtu.dk/pub/Torque/pestat. New features are: 1. The Torque pbs_mom records network load information as the sum of transmit+receive of all interfaces. The "netload" information is defined in the source file ./src/resmom/linux/mom_mach.c as the sum of bytes on all network interfaces since boot time, read from /proc/net/dev. The pestat command (from version 2.9) prints delta-netload information when run twice with some time interval in between. The file $NETLOADFILE stores recorded information. The baseline netload information may be generated from cron, say, every 10 minutes by this crontab entry: */10 * * * * /usr/local/bin/pestat -C > /dev/null If the netload exceeds NETLOADTHRES (2000 Mbit/sec full-duplex), this node will be flagged. Please change NETLOADTHRES if you want to flag lower netloads. If your nodes use Ethernet port bonding, please configure the NETLOADSCALE variable in the script. 2. The "-j jobs" flag lists only those nodes that run at least "jobs" user jobs. If your site policy permits multiple jobs per node, you can use this flag to check specifically any multi-job nodes. General info: The pestat utility is used for printing a 1-line summary of jobs on each node. It parses the output of "pbsnodes -a" and presents the output in a compact, useful format. In particular we use pestat all the time to display only those nodes which have jobs that behave in an unexpected way, for example: # pestat -f Listing only nodes that are flagged by * node state load pmem ncpu mem resi usrs tasks jobids/users n031 excl 0* 7990 4 23992 1380 1/1 4 381711 user1 n040 excl 0* 7990 4 23992 1061 1/1 4 381620 user1 n045 free 0.68* 7990 4 23992 139 0/0 0 n046 free 0.69* 7990 4 23992 140 0/0 0 p013 excl 1* 24110 4 56110 296 1/1 4 400491 user2 p014 excl 1* 24110 4 56110 16036 1/1 4 400491 user2 a063 excl 9.5* 24098 8 72097 1370 1/1 8 400325 user3 a126 excl 8.7* 24098 8 72097 7110 1/1 8 400260 user5 b003 excl 8.5* 24098 8 72097 985 1/1 8 400333 user3 b074 excl 8.6* 24098 8 72097 17123 1/1 8 399435 user4 b109 excl 8.6* 24098 8 72097 1062 1/1 8 400334 user3 c103 excl 8.6* 24098 8 72097 17080 1/1 8 399437 user4 c140 busy* 8 24098 8 72097 20130 1/1 8 393075 user7 d015 excl 5* 24098 8 72097 7235 1/1 8 400453 user6 d034 excl 5* 24098 8 72097 7213 1/1 8 400453 user6 d040 excl 8.5* 24098 8 72097 1177 1/1 8 400350 user3 d050 excl 8.7* 24098 8 72097 17197 1/1 8 399438 user4 Usage: /usr/local/bin/pestat [-f] [-c|-n] [-d] [-V] [-u username|-g groupname] [-j jobs] [-C] [-h] where: -f: Listing only nodes that are flagged by \* -d: Listing also nodes that are down -c/-n: Color/no color output -u username: Print only user (do not use with the -g flag) -g groupname: Print only users in group -j jobs: List only nodes with at least running jobs -C: Use with cron: Netload file will be saved as /tmp/netload.cron -h: Print this help information -V: Version information -- Ole Holm Nielsen Department of Physics, Technical University of Denmark From sm4082 at nyu.edu Tue Sep 27 09:32:54 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Tue, 27 Sep 2011 11:32:54 -0400 Subject: [torqueusers] Job arrays don't show up in pbstop output Message-ID: <601D8486-C76B-44DF-8959-E23AFEB3F2BF@nyu.edu> Hi, Since we upgraded to torque 2.5.8, job arrays are not showing up in pbstop output. I don't have that much expertise in perl to change the pbstop code. If anyone has fixed this problem or could tell me pointers in fixing it, I would greatly appreciate it. I have looked for latest version of pbstop but it looks like I do have the latest one (pbstop-4.16-10.el5 and perl-PBS-0.33-10.el5). Thanks in advance, Sreedhar. From atp42 at cornell.edu Mon Sep 26 12:35:36 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Mon, 26 Sep 2011 14:35:36 -0400 Subject: [torqueusers] Help: Unauthorized Request Message-ID: Hi, I've just tried to install torque, and I ran the following commands, ./configure sudo make sudo make install however when I run ./torque.setup username I get the following... initializing TORQUE (admin: username at ubuntu) PBS_Server ubuntu: Create mode and server database exists, do you wish to continue y/(n)?y Max open servers: 9 qmgr obj= svr=default: Unauthorized Request Max open servers: 9 qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request The server lanched and I cannot stop it, nor can issue any command related to torque (qterm, gmgr, qsub, etc) under my current username or under root. Help! Thank you, -Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110926/2bcb149c/attachment.html From vt7500 at yahoo.com Mon Sep 26 13:20:29 2011 From: vt7500 at yahoo.com (Vt Vt) Date: Mon, 26 Sep 2011 12:20:29 -0700 (PDT) Subject: [torqueusers] Unauthorized Request Message-ID: <1317064829.73299.YahooMailNeo@web125119.mail.ne1.yahoo.com> Hi, I have been baffled by the error "Unauthorized Request" that I keep getting while installing torque. I tried several versions including 3.0.2 and some older versions. System:??? Ubuntu 11.04 Natty Narhwal machine type : its a single cpu (12 core machine) I am trying to use just 8 cores for the setup. My questions are: (1) how to get rid of this error? server logs: PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost # qmgr -c 'p s ' # # Set server attributes. # set server acl_hosts = XXXX set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 #? create queue batch qmgr obj=batch svr=default: Unauthorized Request Qmgr: list server managers Server XXXX.XX.XXXX.XXX?? (no output) Qmgr: set server managers+=root at XXXX qmgr obj= svr=default: Unauthorized Request? trying qterm gives the same result. (2) how do I add a host to server_manager? (3)? How do I completely uninstall torque when I install torque from a tarball using default parameters? (4) has anybody got a detaled info on a ubuntu 11.04 + torque install ? From s.prabhakaran at grs-sim.de Tue Sep 27 06:43:27 2011 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Tue, 27 Sep 2011 14:43:27 +0200 Subject: [torqueusers] Documentation available? In-Reply-To: <4E81C296.6090004@fysik.dtu.dk> References: <4E81C296.6090004@fysik.dtu.dk> Message-ID: Hello all, Are there any design documentation of PBS/Torque (other than the admin guide) available somewhere? If so, could anyone please point me to that? I have been able to find the old PBS (2.2) external reference specification, internal design specification, and requirements specification. Haven't been able to find the external design specification. Is there any documentation available that delves into design? Thank you, Suraj From atp42 at cornell.edu Tue Sep 27 07:53:28 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 09:53:28 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: Please, any help you can give would be greatly appreciated, I'm completely stuck. All the solutions I found online have failed. On Mon, Sep 26, 2011 at 2:35 PM, Aaron wrote: > Hi, > > I've just tried to install torque, and I ran the following commands, > > ./configure > sudo make > sudo make install > > however when I run ./torque.setup username I get the following... > > initializing TORQUE (admin: username at ubuntu) > PBS_Server ubuntu: Create mode and server database exists, > do you wish to continue y/(n)?y > Max open servers: 9 > qmgr obj= svr=default: Unauthorized Request > Max open servers: 9 > qmgr obj= svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > > The server lanched and I cannot stop it, nor can issue any command related > to torque (qterm, gmgr, qsub, etc) under my current username or under root. > Help! > > Thank you, > -Aaron > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/b6f8cac1/attachment.html From dave.zarnoch at sykes.com Tue Sep 27 10:00:14 2011 From: dave.zarnoch at sykes.com (Zarnoch, Dave) Date: Tue, 27 Sep 2011 12:00:14 -0400 Subject: [torqueusers] Documentation available? In-Reply-To: References: <4E81C296.6090004@fysik.dtu.dk> Message-ID: <7651D43C8FD38F458F022D03E5AF91E501050C65@ustpamxc005.amer.sykes.com> Try: http://www.pbsworks.com/SupportDocuments.aspx Dave Dave Zarnoch UNIX Systems Administration (215)200-0911 Dave.Zarnoch at sykes.com -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Suraj Prabhakaran Sent: Tuesday, September 27, 2011 8:43 AM To: Torque Users Mailing List Subject: [torqueusers] Documentation available? Hello all, Are there any design documentation of PBS/Torque (other than the admin guide) available somewhere? If so, could anyone please point me to that? I have been able to find the old PBS (2.2) external reference specification, internal design specification, and requirements specification. Haven't been able to find the external design specification. Is there any documentation available that delves into design? Thank you, Suraj _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jdsmit at sandia.gov Tue Sep 27 10:13:38 2011 From: jdsmit at sandia.gov (Smith, Jerry Don II) Date: Tue, 27 Sep 2011 16:13:38 +0000 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: Message-ID: Are you seeing anything in the pbs_server logs? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 09:53:28 -0400 To: > Subject: Re: [torqueusers] Help: Unauthorized Request Please, any help you can give would be greatly appreciated, I'm completely stuck. All the solutions I found online have failed. On Mon, Sep 26, 2011 at 2:35 PM, Aaron > wrote: Hi, I've just tried to install torque, and I ran the following commands, ./configure sudo make sudo make install however when I run ./torque.setup username I get the following... initializing TORQUE (admin: username at ubuntu) PBS_Server ubuntu: Create mode and server database exists, do you wish to continue y/(n)?y Max open servers: 9 qmgr obj= svr=default: Unauthorized Request Max open servers: 9 qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request The server lanched and I cannot stop it, nor can issue any command related to torque (qterm, gmgr, qsub, etc) under my current username or under root. Help! Thank you, -Aaron _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/d9cae414/attachment-0001.html From knielson at adaptivecomputing.com Tue Sep 27 10:26:39 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 27 Sep 2011 10:26:39 -0600 (MDT) Subject: [torqueusers] Documentation available? In-Reply-To: <7651D43C8FD38F458F022D03E5AF91E501050C65@ustpamxc005.amer.sykes.com> Message-ID: <4a72fb09-7872-4e5c-8cb0-791b9b2416ba@mail> ----- Original Message ----- > From: "Dave Zarnoch" > To: "Torque Users Mailing List" > Sent: Tuesday, September 27, 2011 10:00:14 AM > Subject: Re: [torqueusers] Documentation available? > > Try: > > http://www.pbsworks.com/SupportDocuments.aspx > > Dave > > Dave Zarnoch > UNIX Systems Administration > (215)200-0911 > Dave.Zarnoch at sykes.com > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Suraj > Prabhakaran > Sent: Tuesday, September 27, 2011 8:43 AM > To: Torque Users Mailing List > Subject: [torqueusers] Documentation available? > > Hello all, > > Are there any design documentation of PBS/Torque (other than the > admin > guide) available somewhere? If so, could anyone please point me to > that? > I have been able to find the old PBS (2.2) external reference > specification, internal design specification, and requirements > specification. Haven't been able to find the external design > specification. Is there any documentation available that delves into > design? > > Thank you, > Suraj You can also try http://www.adaptivecomputing.com/resources/docs/torque/index.php Ken From atp42 at cornell.edu Tue Sep 27 10:33:31 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 12:33:31 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: With the execption of the unauthorized request entries it looks like almost everything is okay, execpt for the node file and root localhost (this should be root ubuntu. Thank you for your help! Aaron Here is an except from the server log... 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/server_priv/accounting/20110927 opened 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description file '/var/spool/torque/server_priv/nodes' in setup_nodes() 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 queues 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child process in background 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 11995, loglevel=0 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 ... On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II wrote: > Are you seeing anything in the pbs_server logs? > > -Jerry > > From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 09:53:28 -0400 > To: > Subject: Re: [torqueusers] Help: Unauthorized Request > > Please, any help you can give would be greatly appreciated, I'm completely > stuck. All the solutions I found online have failed. > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron wrote: > >> Hi, >> >> I've just tried to install torque, and I ran the following commands, >> >> ./configure >> sudo make >> sudo make install >> >> however when I run ./torque.setup username I get the following... >> >> initializing TORQUE (admin: username at ubuntu) >> PBS_Server ubuntu: Create mode and server database exists, >> do you wish to continue y/(n)?y >> Max open servers: 9 >> qmgr obj= svr=default: Unauthorized Request >> Max open servers: 9 >> qmgr obj= svr=default: Unauthorized Request >> qmgr obj= svr=default: Unauthorized Request >> qmgr obj= svr=default: Unauthorized Request >> qmgr obj= svr=default: Unauthorized Request >> qmgr obj=batch svr=default: Unauthorized Request >> qmgr obj=batch svr=default: Unauthorized Request >> qmgr obj=batch svr=default: Unauthorized Request >> qmgr obj=batch svr=default: Unauthorized Request >> qmgr obj=batch svr=default: Unauthorized Request >> qmgr obj=batch svr=default: Unauthorized Request >> qmgr obj= svr=default: Unauthorized Request >> >> The server lanched and I cannot stop it, nor can issue any command >> related to torque (qterm, gmgr, qsub, etc) under my current username or >> under root. Help! >> >> Thank you, >> -Aaron >> >> > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/c68adaf4/attachment.html From jdsmit at sandia.gov Tue Sep 27 10:40:21 2011 From: jdsmit at sandia.gov (Smith, Jerry Don II) Date: Tue, 27 Sep 2011 16:40:21 +0000 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: Message-ID: Have you set up hosts.equiv? see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml 1.3.2.1 Server Configuration Overview There are several steps to ensure that the server and the nodes are completely aware of each other and able to communicate directly. Some of this configuration takes place within TORQUE directly using the qmgr command. Other configuration settings are managed using the pbs_server nodes file, DNS files such as /etc/hosts and the /etc/hosts.equiv file. 1.3.2.2 Name Service Configuration Each node, as well as the server, must be able to resolve the name of every node with which it will interact. This can be accomplished using /etc/hosts, DNS, NIS, or other mechanisms. In the case of /etc/hosts, the file can be shared across systems in most cases. -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:33:31 -0400 To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request With the execption of the unauthorized request entries it looks like almost everything is okay, execpt for the node file and root localhost (this should be root ubuntu. Thank you for your help! Aaron Here is an except from the server log... 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/server_priv/accounting/20110927 opened 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description file '/var/spool/torque/server_priv/nodes' in setup_nodes() 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 queues 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child process in background 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 11995, loglevel=0 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 ... On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > wrote: Are you seeing anything in the pbs_server logs? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 09:53:28 -0400 To: > Subject: Re: [torqueusers] Help: Unauthorized Request Please, any help you can give would be greatly appreciated, I'm completely stuck. All the solutions I found online have failed. On Mon, Sep 26, 2011 at 2:35 PM, Aaron > wrote: Hi, I've just tried to install torque, and I ran the following commands, ./configure sudo make sudo make install however when I run ./torque.setup username I get the following... initializing TORQUE (admin: username at ubuntu) PBS_Server ubuntu: Create mode and server database exists, do you wish to continue y/(n)?y Max open servers: 9 qmgr obj= svr=default: Unauthorized Request Max open servers: 9 qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request The server lanched and I cannot stop it, nor can issue any command related to torque (qterm, gmgr, qsub, etc) under my current username or under root. Help! Thank you, -Aaron _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/4b7e14a2/attachment-0001.html From atp42 at cornell.edu Tue Sep 27 10:58:50 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 12:58:50 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: I think I have, I needed to create the file, and I was unsure about the formatting required. This is what I have there. # + + ubuntu atp42 Do I also need to create the nodes file in the torque>server_priv directory? Thanks, Aaron On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II wrote: > Have you set up hosts.equiv? > > see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml > > 1.3.2.1 Server Configuration Overview > > There are several steps to ensure that the server and the nodes are > completely aware of each other and able to communicate directly. Some of > this configuration takes place within TORQUE directly using the *qmgr*command. Other configuration settings are managed using the > *pbs_server* nodes file, DNS files such as /etc/hosts and the > /etc/hosts.equiv file. > 1.3.2.2 Name Service Configuration > > Each node, as well as the server, must be able to resolve the name of every > node with which it will interact. This can be accomplished using > /etc/hosts, *DNS*, *NIS*, or other mechanisms. In the case of /etc/hosts, > the file can be shared across systems in most cases. > > > -Jerry > > From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:33:31 -0400 > > To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request > > With the execption of the unauthorized request entries it looks like > almost everything is okay, execpt for the node file and root localhost > (this should be root ubuntu. > > Thank you for your help! > Aaron > > Here is an except from the server log... > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened > 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, > initialization type = 4 > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened > 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, > initialization type = 4 > 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file > /var/spool/torque/server_priv/accounting/20110927 opened > 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description > file '/var/spool/torque/server_priv/nodes' in setup_nodes() > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 > queues > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 > jobs > 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 > Scheduler:15004 MOM:15002 (server: 'ubuntu') > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent > is exiting > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent > is exiting > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child > process in background > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = > 11995, loglevel=0 > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 3.0.2, loglevel = 0 > 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 3.0.2, loglevel = 0 > ... > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II wrote: > >> Are you seeing anything in the pbs_server logs? >> >> -Jerry >> >> From: Aaron T Perry >> Reply-To: Torque Users Mailing List >> Date: Tue, 27 Sep 2011 09:53:28 -0400 >> To: >> Subject: Re: [torqueusers] Help: Unauthorized Request >> >> Please, any help you can give would be greatly appreciated, I'm >> completely stuck. All the solutions I found online have failed. >> >> On Mon, Sep 26, 2011 at 2:35 PM, Aaron wrote: >> >>> Hi, >>> >>> I've just tried to install torque, and I ran the following commands, >>> >>> ./configure >>> sudo make >>> sudo make install >>> >>> however when I run ./torque.setup username I get the following... >>> >>> initializing TORQUE (admin: username at ubuntu) >>> PBS_Server ubuntu: Create mode and server database exists, >>> do you wish to continue y/(n)?y >>> Max open servers: 9 >>> qmgr obj= svr=default: Unauthorized Request >>> Max open servers: 9 >>> qmgr obj= svr=default: Unauthorized Request >>> qmgr obj= svr=default: Unauthorized Request >>> qmgr obj= svr=default: Unauthorized Request >>> qmgr obj= svr=default: Unauthorized Request >>> qmgr obj=batch svr=default: Unauthorized Request >>> qmgr obj=batch svr=default: Unauthorized Request >>> qmgr obj=batch svr=default: Unauthorized Request >>> qmgr obj=batch svr=default: Unauthorized Request >>> qmgr obj=batch svr=default: Unauthorized Request >>> qmgr obj=batch svr=default: Unauthorized Request >>> qmgr obj= svr=default: Unauthorized Request >>> >>> The server lanched and I cannot stop it, nor can issue any command >>> related to torque (qterm, gmgr, qsub, etc) under my current username or >>> under root. Help! >>> >>> Thank you, >>> -Aaron >>> >>> >> _______________________________________________ torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/10d6d2e8/attachment.html From jdsmit at sandia.gov Tue Sep 27 11:04:41 2011 From: jdsmit at sandia.gov (Smith, Jerry Don II) Date: Tue, 27 Sep 2011 17:04:41 +0000 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: Message-ID: $PBS_HOME/server_priv/nodes needs to encompass your compute nodes node1 np=4 # or however many cores you have node2 np=4 Make sure that those nodes can be resolved via those names from the admin node. Do you have $PBS_HOME/server_name file with the resolvable name of your admin server? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:58:50 -0400 To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request I think I have, I needed to create the file, and I was unsure about the formatting required. This is what I have there. # + + ubuntu atp42 Do I also need to create the nodes file in the torque>server_priv directory? Thanks, Aaron On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > wrote: Have you set up hosts.equiv? see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml 1.3.2.1 Server Configuration Overview There are several steps to ensure that the server and the nodes are completely aware of each other and able to communicate directly. Some of this configuration takes place within TORQUE directly using the qmgr command. Other configuration settings are managed using the pbs_servernodes file, DNS files such as /etc/hosts and the /etc/hosts.equiv file. 1.3.2.2 Name Service Configuration Each node, as well as the server, must be able to resolve the name of every node with which it will interact. This can be accomplished using /etc/hosts, DNS, NIS, or other mechanisms. In the case of /etc/hosts, the file can be shared across systems in most cases. -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:33:31 -0400 To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request With the execption of the unauthorized request entries it looks like almost everything is okay, execpt for the node file and root localhost (this should be root ubuntu. Thank you for your help! Aaron Here is an except from the server log... 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/server_priv/accounting/20110927 opened 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description file '/var/spool/torque/server_priv/nodes' in setup_nodes() 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 queues 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child process in background 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 11995, loglevel=0 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 ... On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > wrote: Are you seeing anything in the pbs_server logs? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 09:53:28 -0400 To: > Subject: Re: [torqueusers] Help: Unauthorized Request Please, any help you can give would be greatly appreciated, I'm completely stuck. All the solutions I found online have failed. On Mon, Sep 26, 2011 at 2:35 PM, Aaron > wrote: Hi, I've just tried to install torque, and I ran the following commands, ./configure sudo make sudo make install however when I run ./torque.setup username I get the following... initializing TORQUE (admin: username at ubuntu) PBS_Server ubuntu: Create mode and server database exists, do you wish to continue y/(n)?y Max open servers: 9 qmgr obj= svr=default: Unauthorized Request Max open servers: 9 qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request The server lanched and I cannot stop it, nor can issue any command related to torque (qterm, gmgr, qsub, etc) under my current username or under root. Help! Thank you, -Aaron _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/73cfe9ea/attachment-0001.html From mreppert at mit.edu Tue Sep 27 11:19:53 2011 From: mreppert at mit.edu (Mike Reppert) Date: Tue, 27 Sep 2011 13:19:53 -0400 Subject: [torqueusers] Unauthorized Request In-Reply-To: <1317064829.73299.YahooMailNeo@web125119.mail.ne1.yahoo.com> References: <1317064829.73299.YahooMailNeo@web125119.mail.ne1.yahoo.com> Message-ID: I ran into exactly the same issue installing on Natty. The solution for me was to modify the /etc/hosts file from the Ubuntu default. I have pasted below some notes that I made on fixing the issues in our install. Best, Mike The installation instructions at http://www.clusterresources.com/torquedocs21/1.1installation.shtml worked very well, up to a few glitches (which cost several days) described below: the first with recognizing the proper host address (fixed by editing the /etc/hosts file) and the second due to problems with password-less ssh for scp file transfer from the compute nodes back to the head (fixed using ssh-keygen and setting the proper permissions). The online installation instructions say simply to go to the torque-2.4.8 directory and type (as sudo): $ ./configure $ make $ make install I found that in addition, before running Torque (probably better before installation), the file /etc/hosts needs to be modified from the Ubuntu default. *In place of *the first few lines (the ipv4 part) reading 127.0.0.1 localhost 127.0.1.1 < more stuff about IPv6 > (the lines here are the Ubuntu 11.04 default) one should add the actual (static, internal) ip addresses of both the head and compute nodes: 127.0.0.1 localhost < same stuff about IPv6 > For example, if your head node domain name is headnode (local static ip 192.168.1.100) and you have two compute nodes named compute-0-0 and compute-0-1, the /etc/hosts might look like 127.0.0.1 localhost 192.168.1.100 headnode 192.168.1.253 compute-0-0 192.168.1.254 compute-0-1 < same stuff about IPv6 > This is important so that the pbs_server recognizes the head node as the actual host -- otherwise, it will be confused and try doing things like communicating with "root at localhost" instead of "root@" (i.e. ?root at headnode? in our example). The problem is that both 127.0.0.1 and 127.0.1.1 are ip addresses which point to the computer itself. One needs the actual static ip of headnode in the second line or either (1) the qmgr on the head node will not recognize the head node itself or (2) the compute nodes will try communicating with "127.0.0.1" (i.e. themselves) rather than with the head node. On Mon, Sep 26, 2011 at 3:20 PM, Vt Vt wrote: > > > Hi, > I have been baffled by the error "Unauthorized Request" that I keep getting > while installing torque. I tried several versions including 3.0.2 and some > older versions. > > System: Ubuntu 11.04 Natty Narhwal > machine type : its a single cpu (12 core machine) > > I am trying to use just 8 cores for the setup. My questions are: > > (1) how to get rid of this error? > server logs: > PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), > aux=0, type=Manager, from root at localhost > > > # qmgr -c 'p s ' > # > # Set server attributes. > # > set server acl_hosts = XXXX > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > > # create queue batch > > qmgr obj=batch svr=default: Unauthorized Request > > > Qmgr: list server managers > Server XXXX.XX.XXXX.XXX (no output) > > > > Qmgr: set server managers+=root at XXXX > qmgr obj= svr=default: Unauthorized Request > > trying qterm gives the same result. > > > > (2) how do I add a host to server_manager? > > > > (3) How do I completely uninstall torque when I install torque from a > tarball using default parameters? > > (4) has anybody got a detaled info on a ubuntu 11.04 + torque install ? > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/af6d315e/attachment.html From atp42 at cornell.edu Tue Sep 27 11:23:28 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 13:23:28 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: Yes, that I do i have, that was the first thing I came across when looking through help online. And I added the nodes file with appropriate settings for my machine, but I still get the same errors. I have a completely unrelated question. I'm doing all this to run a model that I've been trying to port. I'm trying to figure out whether a segmentation fault I'm getting at runtime (using mpirun ./ccsm.exe) is due to a compiler error, or a stack/memory error (the code works on many other machines, not necessarily the compiler I'm using though). If I can install torque I can use an automated script that also sets appropriate stack size, among other things. I am on 1 computer, with 1 node, and I have no desire to scale this instance of the model. Basically I'm wondering if you think there might be an easier/better alternative? Thank you, Aaron On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II wrote: > $PBS_HOME/server_priv/nodes needs to encompass your compute nodes > > node1 np=4 # or however many cores you have > node2 np=4 > > Make sure that those nodes can be resolved via those names from the admin > node. > > Do you have $PBS_HOME/server_name file with the resolvable name of your > admin server? > > -Jerry > > From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:58:50 -0400 > > To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request > > I think I have, I needed to create the file, and I was unsure about the > formatting required. > This is what I have there. > > # + + ubuntu atp42 > > Do I also need to create the nodes file in the torque>server_priv > directory? > > Thanks, > Aaron > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II wrote: > >> Have you set up hosts.equiv? >> >> see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml >> >> 1.3.2.1 Server Configuration Overview >> >> There are several steps to ensure that the server and the nodes are >> completely aware of each other and able to communicate directly. Some of >> this configuration takes place within TORQUE directly using the *qmgr*command. Other configuration settings are managed using the >> *pbs_server*nodes file, DNS files such as /etc/hosts and the >> /etc/hosts.equiv file. >> 1.3.2.2 Name Service Configuration >> >> Each node, as well as the server, must be able to resolve the name of >> every node with which it will interact. This can be accomplished using >> /etc/hosts, *DNS*, *NIS*, or other mechanisms. In the case of /etc/hosts, >> the file can be shared across systems in most cases. >> >> >> -Jerry >> >> From: Aaron T Perry >> Reply-To: Torque Users Mailing List >> Date: Tue, 27 Sep 2011 12:33:31 -0400 >> >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Help: Unauthorized Request >> >> With the execption of the unauthorized request entries it looks like >> almost everything is okay, execpt for the node file and root localhost >> (this should be root ubuntu. >> >> Thank you for your help! >> Aaron >> >> Here is an except from the server log... >> >> 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened >> 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, >> initialization type = 4 >> 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened >> 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, >> initialization type = 4 >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file >> /var/spool/torque/server_priv/accounting/20110927 opened >> 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() >> 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node >> description file '/var/spool/torque/server_priv/nodes' in setup_nodes() >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 >> queues >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 >> jobs >> 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports >> Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent >> is exiting >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent >> is exiting >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child >> process in background >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = >> 11995, loglevel=0 >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = >> 3.0.2, loglevel = 0 >> 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = >> 3.0.2, loglevel = 0 >> ... >> >> On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II wrote: >> >>> Are you seeing anything in the pbs_server logs? >>> >>> -Jerry >>> >>> From: Aaron T Perry >>> Reply-To: Torque Users Mailing List >>> Date: Tue, 27 Sep 2011 09:53:28 -0400 >>> To: >>> Subject: Re: [torqueusers] Help: Unauthorized Request >>> >>> Please, any help you can give would be greatly appreciated, I'm >>> completely stuck. All the solutions I found online have failed. >>> >>> On Mon, Sep 26, 2011 at 2:35 PM, Aaron wrote: >>> >>>> Hi, >>>> >>>> I've just tried to install torque, and I ran the following commands, >>>> >>>> ./configure >>>> sudo make >>>> sudo make install >>>> >>>> however when I run ./torque.setup username I get the following... >>>> >>>> initializing TORQUE (admin: username at ubuntu) >>>> PBS_Server ubuntu: Create mode and server database exists, >>>> do you wish to continue y/(n)?y >>>> Max open servers: 9 >>>> qmgr obj= svr=default: Unauthorized Request >>>> Max open servers: 9 >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> >>>> The server lanched and I cannot stop it, nor can issue any command >>>> related to torque (qterm, gmgr, qsub, etc) under my current username or >>>> under root. Help! >>>> >>>> Thank you, >>>> -Aaron >>>> >>>> >>> _______________________________________________ torqueusers mailing >>> list torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> _______________________________________________ torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/23b3f9d7/attachment-0001.html From gus at ldeo.columbia.edu Tue Sep 27 12:10:32 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 27 Sep 2011 14:10:32 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: <4E821198.5090409@ldeo.columbia.edu> Aron You can set the stack size unlimited in /etc/security/limits.conf (here along with locked memory and number of open files): * - memlock -1 * - stack -1 * - nofile 4096 Granted that the above is RHEL/CentOS style, Debian/Ubuntu may be different/different file. Also, you may want to check your /var/log/messages [or whatever Ubuntu uses for system logs] and see if it sheds more light into the pbs_server errors. My guess is that you need consistent server names in server_name, server_priv/nodes [assuming your server is also a work node running pbs_mom], mom_priv/config (for $pbsserver). My recollection is that these default to 'localhost' [and 127.0.0.1], if your installation is in a *single standalone machine*, but I am not sure. And you need right name resolution in /etc/hosts, as Mike Reppert and Jerry Smith pointed out. Also, not related, but you need to enable scheduling [after the current problem is sorted out]: qmgr -c 'set server scheduling = True' Out of curiosity, is it a single machine or a small cluster? I hope this helps, Gus Correa Aaron T Perry wrote: > Yes, that I do i have, that was the first thing I came across when > looking through help online. > > And I added the nodes file with appropriate settings for my machine, but > I still get the same errors. > > I have a completely unrelated question. I'm doing all this to run a > model that I've been trying to port. I'm trying to figure out whether a > segmentation fault I'm getting at runtime (using mpirun ./ccsm.exe) is > due to a compiler error, or a stack/memory error (the code works on many > other machines, not necessarily the compiler I'm using though). If I can > install torque I can use an automated script that also > sets appropriate stack size, among other things. I am on 1 computer, > with 1 node, and I have no desire to scale this instance of the model. > Basically I'm wondering if you think there might be an easier/better > alternative? > > Thank you, > Aaron > > > On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > wrote: > > $PBS_HOME/server_priv/nodes needs to encompass your compute nodes > > node1 np=4 # or however many cores you have > node2 np=4 > > Make sure that those nodes can be resolved via those names from the > admin node. > > Do you have $PBS_HOME/server_name file with the resolvable name of > your admin server? > > -Jerry > > From: Aaron T Perry > > Reply-To: Torque Users Mailing List > > Date: Tue, 27 Sep 2011 12:58:50 -0400 > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] Help: Unauthorized Request > > I think I have, I needed to create the file, and I was unsure about > the formatting required. > This is what I have there. > > # + + ubuntu atp42 > > Do I also need to create the nodes file in the torque>server_priv > directory? > > Thanks, > Aaron > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > > wrote: > > Have you set up hosts.equiv? > > see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml > > > 1.3.2.1 Server Configuration Overview > > There are several steps to ensure that the server and the nodes > are completely aware of each other and able to communicate > directly. Some of this configuration takes place within TORQUE > directly using the *qmgr* command. Other configuration settings > are managed using the *pbs_server*nodes file, DNS files such as > /etc/hosts and the /etc/hosts.equiv file. > > > 1.3.2.2 Name Service Configuration > > Each node, as well as the server, must be able to resolve the > name of every node with which it will interact. This can be > accomplished using /etc/hosts, *DNS*, *NIS*, or other > mechanisms. In the case of /etc/hosts, the file can be shared > across systems in most cases. > > > -Jerry > > > From: Aaron T Perry > > Reply-To: Torque Users Mailing List > > > Date: Tue, 27 Sep 2011 12:33:31 -0400 > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] Help: Unauthorized Request > > With the execption of the unauthorized request entries it looks > like almost everything is okay, execpt for the node file and > root localhost (this should be root ubuntu. > > Thank you for your help! > Aaron > > Here is an except from the server log... > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened > 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu > started, initialization type = 4 > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened > 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu > started, initialization type = 4 > 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file > /var/spool/torque/server_priv/accounting/20110927 opened > 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node > description file '/var/spool/torque/server_priv/nodes' in > setup_nodes() > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > recovered 0 queues > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > recovered 0 jobs > 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports > Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > parent is exiting > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > parent is exiting > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > child process in background > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, > pid = 11995, loglevel=0 > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from > root at localhost > 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server > Version = 3.0.2, loglevel = 0 > 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server > Version = 3.0.2, loglevel = 0 > ... > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > > wrote: > > Are you seeing anything in the pbs_server logs? > > -Jerry > > From: Aaron T Perry > > Reply-To: Torque Users Mailing List > > > Date: Tue, 27 Sep 2011 09:53:28 -0400 > To: > > Subject: Re: [torqueusers] Help: Unauthorized Request > > Please, any help you can give would be greatly appreciated, > I'm completely stuck. All the solutions I found online have > failed. > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron > wrote: > > Hi, > > I've just tried to install torque, and I ran the > following commands, > > ./configure > sudo make > sudo make install > > however when I run ./torque.setup username I get the > following... > > initializing TORQUE (admin: username at ubuntu) > PBS_Server ubuntu: Create mode and server database exists, > do you wish to continue y/(n)?y > Max open servers: 9 > qmgr obj= svr=default: Unauthorized Request > Max open servers: 9 > qmgr obj= svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj=batch svr=default: Unauthorized Request > qmgr obj= svr=default: Unauthorized Request > > The server lanched and I cannot stop it, nor can issue > any command related to torque (qterm, gmgr, qsub, etc) > under my current username or under root. Help! > > Thank you, > -Aaron > > > _______________________________________________ torqueusers > mailing list torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ torqueusers > mailing list torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ torqueusers mailing > list torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Tue Sep 27 12:45:02 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 27 Sep 2011 13:45:02 -0500 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: For just one computer, which write the following script files (assumes you have 256GB of memory, modify as needed.) scr0: #!/bin/bash for j in 1G 2G 4G 8G 16G 32G 64G 128G 256G ; do echo "Try $j " ./scr1 $j done exit scr1: #!/bin/csh -f setenv F t1.$$ /bin/rm -f $F hostname > $F limit stacksize $1 mpirun -n 4 --machinefile $F ./ccsm.exe /bin/rm -f $F exit make both executable with chmod u+x scr0 scr1 and then issue ./scr0 Modify the above procedure as needed. If this is not just caused by a stack limit error, I'd look at either a compiler optimization bug (recompile run with -O0 and run) or more likely a programming error (we all make them.) I'd recompile and check for bounds (e.g. -C on most Fortran compilers), and uninitialized variables (-uvar on PathScale or Open64 compilers. -rabc also works well on Cray Compilers. You can also use a parallel debugger like Totalview or DDT, or you can use a run-time error detection tool like MPI-CHECK (Fortran only) or Marmot. (See http://rted.public.iastate.edu/MPI/RESULTS/result_table.html for the kinds of errors that these can catch) See http://rted.public.iastate.edu/Serial/RESULTS/result_table.html for program errors other than those involving MPI routines. If you click on items under the OS/Compiler/Runtime tool column, you can see the suggested compiler options for best debugging for that Compiler or tool. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Aaron T Perry Sent: Tuesday, September 27, 2011 12:23 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Help: Unauthorized Request Yes, that I do i have, that was the first thing I came across when looking through help online. And I added the nodes file with appropriate settings for my machine, but I still get the same errors. I have a completely unrelated question. I'm doing all this to run a model that I've been trying to port. I'm trying to figure out whether a segmentation fault I'm getting at runtime (using mpirun ./ccsm.exe) is due to a compiler error, or a stack/memory error (the code works on many other machines, not necessarily the compiler I'm using though). If I can install torque I can use an automated script that also sets appropriate stack size, among other things. I am on 1 computer, with 1 node, and I have no desire to scale this instance of the model. Basically I'm wondering if you think there might be an easier/better alternative? Thank you, Aaron On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > wrote: $PBS_HOME/server_priv/nodes needs to encompass your compute nodes node1 np=4 # or however many cores you have node2 np=4 Make sure that those nodes can be resolved via those names from the admin node. Do you have $PBS_HOME/server_name file with the resolvable name of your admin server? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:58:50 -0400 To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request I think I have, I needed to create the file, and I was unsure about the formatting required. This is what I have there. # + + ubuntu atp42 Do I also need to create the nodes file in the torque>server_priv directory? Thanks, Aaron On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > wrote: Have you set up hosts.equiv? see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml 1.3.2.1 Server Configuration Overview There are several steps to ensure that the server and the nodes are completely aware of each other and able to communicate directly. Some of this configuration takes place within TORQUE directly using the qmgr command. Other configuration settings are managed using the pbs_servernodes file, DNS files such as /etc/hosts and the /etc/hosts.equiv file. 1.3.2.2 Name Service Configuration Each node, as well as the server, must be able to resolve the name of every node with which it will interact. This can be accomplished using /etc/hosts, DNS, NIS, or other mechanisms. In the case of /etc/hosts, the file can be shared across systems in most cases. -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:33:31 -0400 To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request With the execption of the unauthorized request entries it looks like almost everything is okay, execpt for the node file and root localhost (this should be root ubuntu. Thank you for your help! Aaron Here is an except from the server log... 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/server_priv/accounting/20110927 opened 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description file '/var/spool/torque/server_priv/nodes' in setup_nodes() 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 queues 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child process in background 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 11995, loglevel=0 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 ... On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > wrote: Are you seeing anything in the pbs_server logs? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 09:53:28 -0400 To: > Subject: Re: [torqueusers] Help: Unauthorized Request Please, any help you can give would be greatly appreciated, I'm completely stuck. All the solutions I found online have failed. On Mon, Sep 26, 2011 at 2:35 PM, Aaron > wrote: Hi, I've just tried to install torque, and I ran the following commands, ./configure sudo make sudo make install however when I run ./torque.setup username I get the following... initializing TORQUE (admin: username at ubuntu) PBS_Server ubuntu: Create mode and server database exists, do you wish to continue y/(n)?y Max open servers: 9 qmgr obj= svr=default: Unauthorized Request Max open servers: 9 qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request The server lanched and I cannot stop it, nor can issue any command related to torque (qterm, gmgr, qsub, etc) under my current username or under root. Help! Thank you, -Aaron _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/c2950562/attachment-0001.html From atp42 at cornell.edu Tue Sep 27 12:47:02 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 14:47:02 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: <4E821198.5090409@ldeo.columbia.edu> References: <4E821198.5090409@ldeo.columbia.edu> Message-ID: This is a single machine, it's a virtual machine running on my Windows 7 desktop. Thanks, I'm trying your suggestion now. Thanks, Aaron On Tue, Sep 27, 2011 at 2:10 PM, Gus Correa wrote: > Aron > > You can set the stack size unlimited in /etc/security/limits.conf > (here along with locked memory and number of open files): > > * - memlock -1 > * - stack -1 > * - nofile 4096 > > Granted that the above is RHEL/CentOS style, > Debian/Ubuntu may be different/different file. > > Also, you may want to check your /var/log/messages [or whatever Ubuntu > uses for system logs] and see if it sheds more light into > the pbs_server errors. > > My guess is that you need consistent server names in server_name, > server_priv/nodes [assuming your server is also a work > node running pbs_mom], mom_priv/config (for $pbsserver). > My recollection is that these default to 'localhost' [and 127.0.0.1], > if your installation is in a *single standalone machine*, > but I am not sure. > And you need right name resolution in /etc/hosts, as Mike Reppert > and Jerry Smith pointed out. > > Also, not related, but you need to enable scheduling [after the > current problem is sorted out]: > > qmgr -c 'set server scheduling = True' > > Out of curiosity, is it a single machine or a small cluster? > > I hope this helps, > Gus Correa > > Aaron T Perry wrote: > > Yes, that I do i have, that was the first thing I came across when > > looking through help online. > > > > And I added the nodes file with appropriate settings for my machine, but > > I still get the same errors. > > > > I have a completely unrelated question. I'm doing all this to run a > > model that I've been trying to port. I'm trying to figure out whether a > > segmentation fault I'm getting at runtime (using mpirun ./ccsm.exe) is > > due to a compiler error, or a stack/memory error (the code works on many > > other machines, not necessarily the compiler I'm using though). If I can > > install torque I can use an automated script that also > > sets appropriate stack size, among other things. I am on 1 computer, > > with 1 node, and I have no desire to scale this instance of the model. > > Basically I'm wondering if you think there might be an easier/better > > alternative? > > > > Thank you, > > Aaron > > > > > > On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > > wrote: > > > > $PBS_HOME/server_priv/nodes needs to encompass your compute nodes > > > > node1 np=4 # or however many cores you have > > node2 np=4 > > > > Make sure that those nodes can be resolved via those names from the > > admin node. > > > > Do you have $PBS_HOME/server_name file with the resolvable name of > > your admin server? > > > > -Jerry > > > > From: Aaron T Perry > > > Reply-To: Torque Users Mailing List > > > > Date: Tue, 27 Sep 2011 12:58:50 -0400 > > > > To: Torque Users Mailing List > > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > I think I have, I needed to create the file, and I was unsure about > > the formatting required. > > This is what I have there. > > > > # + + ubuntu atp42 > > > > Do I also need to create the nodes file in the torque>server_priv > > directory? > > > > Thanks, > > Aaron > > > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > > > wrote: > > > > Have you set up hosts.equiv? > > > > see: > http://www.clusterresources.com/torquedocs/1.3advconfig.shtml > > > > > > 1.3.2.1 Server Configuration Overview > > > > There are several steps to ensure that the server and the nodes > > are completely aware of each other and able to communicate > > directly. Some of this configuration takes place within TORQUE > > directly using the *qmgr* command. Other configuration settings > > are managed using the *pbs_server*nodes file, DNS files such as > > /etc/hosts and the /etc/hosts.equiv file. > > > > > > 1.3.2.2 Name Service Configuration > > > > Each node, as well as the server, must be able to resolve the > > name of every node with which it will interact. This can be > > accomplished using /etc/hosts, *DNS*, *NIS*, or other > > mechanisms. In the case of /etc/hosts, the file can be shared > > across systems in most cases. > > > > > > -Jerry > > > > > > From: Aaron T Perry >> > > Reply-To: Torque Users Mailing List > > torqueusers at supercluster.org>> > > Date: Tue, 27 Sep 2011 12:33:31 -0400 > > > > To: Torque Users Mailing List > > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > With the execption of the unauthorized request entries it looks > > like almost everything is okay, execpt for the node file and > > root localhost (this should be root ubuntu. > > > > Thank you for your help! > > Aaron > > > > Here is an except from the server log... > > > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened > > 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu > > started, initialization type = 4 > > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened > > 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu > > started, initialization type = 4 > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file > > /var/spool/torque/server_priv/accounting/20110927 opened > > 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() > > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node > > description file '/var/spool/torque/server_priv/nodes' in > > setup_nodes() > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > recovered 0 queues > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > recovered 0 jobs > > 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports > > Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > parent is exiting > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > parent is exiting > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > child process in background > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, > > pid = 11995, loglevel=0 > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server > > Version = 3.0.2, loglevel = 0 > > 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server > > Version = 3.0.2, loglevel = 0 > > ... > > > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > > > wrote: > > > > Are you seeing anything in the pbs_server logs? > > > > -Jerry > > > > From: Aaron T Perry > > > > Reply-To: Torque Users Mailing List > > > > > > Date: Tue, 27 Sep 2011 09:53:28 -0400 > > To: > > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > Please, any help you can give would be greatly appreciated, > > I'm completely stuck. All the solutions I found online have > > failed. > > > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron > > wrote: > > > > Hi, > > > > I've just tried to install torque, and I ran the > > following commands, > > > > ./configure > > sudo make > > sudo make install > > > > however when I run ./torque.setup username I get the > > following... > > > > initializing TORQUE (admin: username at ubuntu) > > PBS_Server ubuntu: Create mode and server database > exists, > > do you wish to continue y/(n)?y > > Max open servers: 9 > > qmgr obj= svr=default: Unauthorized Request > > Max open servers: 9 > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > > > The server lanched and I cannot stop it, nor can issue > > any command related to torque (qterm, gmgr, qsub, etc) > > under my current username or under root. Help! > > > > Thank you, > > -Aaron > > > > > > _______________________________________________ torqueusers > > mailing list torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ torqueusers > > mailing list torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org torqueusers at supercluster.org> > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ torqueusers mailing > > list torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/75862a55/attachment-0001.html From atp42 at cornell.edu Tue Sep 27 12:55:28 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 14:55:28 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: Thank you, I'll try these suggestions. I'm relatively new at this and sometimes feel I'm in over my head. I'm almost certain this is a compiler or stack limit error. I didn't write the code, and it's known to work on a variety of systems but only commercial compilers are officially supported (i'm using gcc). Thank you again, Aaron On Tue, Sep 27, 2011 at 2:45 PM, Coyle, James J [ITACD] wrote: > For just one computer, which write the following script files (assumes you > have 256GB of memory,**** > > modify as needed.)**** > > ** ** > > scr0:**** > > #!/bin/bash**** > > ** ** > > for j in 1G 2G 4G 8G 16G 32G 64G 128G 256G ; do**** > > echo ?Try $j ?**** > > ./scr1 $j**** > > done **** > > exit**** > > ** ** > > ** ** > > ** ** > > scr1:**** > > #!/bin/csh ?f**** > > ** ** > > setenv F t1.$$**** > > /bin/rm ?f $F**** > > hostname > $F**** > > limit stacksize $1**** > > mpirun ?n 4 -?machinefile $F ./ccsm.exe**** > > /bin/rm ?f $F**** > > exit**** > > ** ** > > ** ** > > make both executable with **** > > chmod u+x scr0 scr1**** > > ** ** > > ** ** > > and then issue **** > > ** ** > > ./scr0**** > > ** ** > > ** ** > > Modify the above procedure as needed.**** > > ** ** > > If this is not just caused by a stack limit error, I?d look at either a > compiler optimization bug (recompile run with ?O0**** > > and run) or more likely a programming error (we all make them.)**** > > ** ** > > I?d recompile and check for bounds (e.g. ?C on most Fortran compilers), > and uninitialized variables (-uvar on**** > > PathScale or Open64 compilers. ?rabc also works well on Cray Compilers.** > ** > > ** ** > > You can also use a parallel debugger like Totalview or DDT, or you can > use a run-time error detection tool**** > > like MPI-CHECK (Fortran only) or Marmot. (See > http://rted.public.iastate.edu/MPI/RESULTS/result_table.html**** > > for the kinds of errors that these can catch) See > http://rted.public.iastate.edu/Serial/RESULTS/result_table.html **** > > for program errors other than those involving MPI routines. If you click > on items under the**** > > OS/Compiler/Runtime tool column, you can see the suggested compiler options > for best debugging**** > > for that Compiler or tool.**** > > ** ** > > ** ** > > ** ** > > ** ** > > **** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Aaron T Perry > *Sent:* Tuesday, September 27, 2011 12:23 PM > > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] Help: Unauthorized Request**** > > ** ** > > Yes, that I do i have, that was the first thing I came across when looking > through help online.**** > > ** ** > > And I added the nodes file with appropriate settings for my machine, but I > still get the same errors.**** > > ** ** > > I have a completely unrelated question. I'm doing all this to run a model > that I've been trying to port. I'm trying to figure out whether a > segmentation fault I'm getting at runtime (using mpirun ./ccsm.exe) is due > to a compiler error, or a stack/memory error (the code works on many other > machines, not necessarily the compiler I'm using though). If I can install > torque I can use an automated script that also sets appropriate stack size, > among other things. I am on 1 computer, with 1 node, and I have no desire to > scale this instance of the model. Basically I'm wondering if you think there > might be an easier/better alternative? **** > > ** ** > > Thank you,**** > > Aaron**** > > ** ** > > On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > wrote:**** > > $PBS_HOME/server_priv/nodes needs to encompass your compute nodes**** > > ** ** > > node1 np=4 # or however many cores you have**** > > node2 np=4**** > > ** ** > > Make sure that those nodes can be resolved via those names from the admin > node.**** > > ** ** > > Do you have $PBS_HOME/server_name file with the resolvable name of your > admin server?**** > > ** ** > > -Jerry**** > > ** ** > > *From: *Aaron T Perry > *Reply-To: *Torque Users Mailing List **** > > *Date: *Tue, 27 Sep 2011 12:58:50 -0400**** > > > *To: *Torque Users Mailing List > *Subject: *Re: [torqueusers] Help: Unauthorized Request**** > > ** ** > > I think I have, I needed to create the file, and I was unsure about the > formatting required. **** > > This is what I have there.**** > > ** ** > > # + + ubuntu atp42**** > > ** ** > > Do I also need to create the nodes file in the torque>server_priv > directory?**** > > ** ** > > Thanks,**** > > Aaron**** > > ** ** > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > wrote:**** > > Have you set up hosts.equiv?**** > > ** ** > > see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml**** > > ** ** > 1.3.2.1 Server Configuration Overview**** > > There are several steps to ensure that the server and the nodes are > completely aware of each other and able to communicate directly. Some of > this configuration takes place within TORQUE directly using the *qmgr*command. Other configuration settings are managed using the > *pbs_server*nodes file, DNS files such as /etc/hosts and the > /etc/hosts.equiv file.**** > 1.3.2.2 Name Service Configuration**** > > Each node, as well as the server, must be able to resolve the name of every > node with which it will interact. This can be accomplished using > /etc/hosts, *DNS*, *NIS*, or other mechanisms. In the case of /etc/hosts, > the file can be shared across systems in most cases.**** > > ** ** > > -Jerry**** > > ** ** > > *From: *Aaron T Perry > *Reply-To: *Torque Users Mailing List **** > > *Date: *Tue, 27 Sep 2011 12:33:31 -0400 **** > > > *To: *Torque Users Mailing List **** > > *Subject: *Re: [torqueusers] Help: Unauthorized Request**** > > ** ** > > With the execption of the unauthorized request entries it looks like almost > everything is okay, execpt for the node file and root localhost (this > should be root ubuntu. **** > > ** ** > > Thank you for your help!**** > > Aaron**** > > ** ** > > Here is an except from the server log...**** > > ** ** > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened**** > > 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, > initialization type = 4**** > > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened**** > > 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, > initialization type = 4**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file > /var/spool/torque/server_priv/accounting/20110927 opened**** > > 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes()**** > > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description > file '/var/spool/torque/server_priv/nodes' in setup_nodes()**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 > queues**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 > jobs**** > > 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 > Scheduler:15004 MOM:15002 (server: 'ubuntu')**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent > is exiting**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent > is exiting**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child > process in background**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = > 11995, loglevel=0**** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 3.0.2, loglevel = 0**** > > 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 3.0.2, loglevel = 0**** > > ...**** > > ** ** > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > wrote:**** > > Are you seeing anything in the pbs_server logs?**** > > ** ** > > -Jerry**** > > ** ** > > *From: *Aaron T Perry > *Reply-To: *Torque Users Mailing List > *Date: *Tue, 27 Sep 2011 09:53:28 -0400 > *To: * > *Subject: *Re: [torqueusers] Help: Unauthorized Request**** > > ** ** > > Please, any help you can give would be greatly appreciated, I'm completely > stuck. All the solutions I found online have failed. **** > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron wrote:**** > > Hi, **** > > ** ** > > I've just tried to install torque, and I ran the following commands,**** > > ** ** > > ./configure**** > > sudo make**** > > sudo make install**** > > ** ** > > however when I run ./torque.setup username I get the following...**** > > ** ** > > initializing TORQUE (admin: username at ubuntu)**** > > PBS_Server ubuntu: Create mode and server database exists, **** > > do you wish to continue y/(n)?y**** > > Max open servers: 9**** > > qmgr obj= svr=default: Unauthorized Request **** > > Max open servers: 9**** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > ** ** > > The server lanched and I cannot stop it, nor can issue any command related > to torque (qterm, gmgr, qsub, etc) under my current username or under root. > Help!**** > > ** ** > > Thank you,**** > > -Aaron**** > > ** ** > > ** ** > > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > ** ** > > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > ** ** > > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/8ff9cf8b/attachment-0001.html From gus at ldeo.columbia.edu Tue Sep 27 13:01:38 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 27 Sep 2011 15:01:38 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: <4E821198.5090409@ldeo.columbia.edu> Message-ID: <4E821D92.3080502@ldeo.columbia.edu> Now I wonder if part of the problem is due to it being a virtual machine. - Does torque work in a virtual environment? - How does MPI {whatever MPI you're using] behave [works?, performs well?] in a virtual environment? - Does something as big as ccsm [your ultimate goal apparently] work in a virtual environment? Honestly, I don't really know. For what it is worth, we run ccsm/cesm in a Linux cluster with Torque, OpenMPI, etc. No virtualization, though. Gus Correa Aaron T Perry wrote: > This is a single machine, it's a virtual machine running on my Windows 7 > desktop. Thanks, I'm trying your suggestion now. > > Thanks, > Aaron > > On Tue, Sep 27, 2011 at 2:10 PM, Gus Correa > wrote: > > Aron > > You can set the stack size unlimited in /etc/security/limits.conf > (here along with locked memory and number of open files): > > * - memlock -1 > * - stack -1 > * - nofile 4096 > > Granted that the above is RHEL/CentOS style, > Debian/Ubuntu may be different/different file. > > Also, you may want to check your /var/log/messages [or whatever Ubuntu > uses for system logs] and see if it sheds more light into > the pbs_server errors. > > My guess is that you need consistent server names in server_name, > server_priv/nodes [assuming your server is also a work > node running pbs_mom], mom_priv/config (for $pbsserver). > My recollection is that these default to 'localhost' [and 127.0.0.1], > if your installation is in a *single standalone machine*, > but I am not sure. > And you need right name resolution in /etc/hosts, as Mike Reppert > and Jerry Smith pointed out. > > Also, not related, but you need to enable scheduling [after the > current problem is sorted out]: > > qmgr -c 'set server scheduling = True' > > Out of curiosity, is it a single machine or a small cluster? > > I hope this helps, > Gus Correa > > Aaron T Perry wrote: > > Yes, that I do i have, that was the first thing I came across when > > looking through help online. > > > > And I added the nodes file with appropriate settings for my > machine, but > > I still get the same errors. > > > > I have a completely unrelated question. I'm doing all this to run a > > model that I've been trying to port. I'm trying to figure out > whether a > > segmentation fault I'm getting at runtime (using mpirun > ./ccsm.exe) is > > due to a compiler error, or a stack/memory error (the code works > on many > > other machines, not necessarily the compiler I'm using though). > If I can > > install torque I can use an automated script that also > > sets appropriate stack size, among other things. I am on 1 computer, > > with 1 node, and I have no desire to scale this instance of the > model. > > Basically I'm wondering if you think there might be an easier/better > > alternative? > > > > Thank you, > > Aaron > > > > > > On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > > > >> wrote: > > > > $PBS_HOME/server_priv/nodes needs to encompass your compute nodes > > > > node1 np=4 # or however many cores you have > > node2 np=4 > > > > Make sure that those nodes can be resolved via those names > from the > > admin node. > > > > Do you have $PBS_HOME/server_name file with the resolvable > name of > > your admin server? > > > > -Jerry > > > > From: Aaron T Perry >> > > Reply-To: Torque Users Mailing List > > > >> > > Date: Tue, 27 Sep 2011 12:58:50 -0400 > > > > To: Torque Users Mailing List > > >> > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > I think I have, I needed to create the file, and I was unsure > about > > the formatting required. > > This is what I have there. > > > > # + + ubuntu atp42 > > > > Do I also need to create the nodes file in the torque>server_priv > > directory? > > > > Thanks, > > Aaron > > > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > > > >> wrote: > > > > Have you set up hosts.equiv? > > > > see: > http://www.clusterresources.com/torquedocs/1.3advconfig.shtml > > > > > > 1.3.2.1 Server Configuration Overview > > > > There are several steps to ensure that the server and the > nodes > > are completely aware of each other and able to communicate > > directly. Some of this configuration takes place within > TORQUE > > directly using the *qmgr* command. Other configuration > settings > > are managed using the *pbs_server*nodes file, DNS files > such as > > /etc/hosts and the /etc/hosts.equiv file. > > > > > > 1.3.2.2 Name Service Configuration > > > > Each node, as well as the server, must be able to resolve the > > name of every node with which it will interact. This can be > > accomplished using /etc/hosts, *DNS*, *NIS*, or other > > mechanisms. In the case of /etc/hosts, the file can be shared > > across systems in most cases. > > > > > > -Jerry > > > > > > From: Aaron T Perry >> > > Reply-To: Torque Users Mailing List > > > >> > > Date: Tue, 27 Sep 2011 12:33:31 -0400 > > > > To: Torque Users Mailing List > > > >> > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > With the execption of the unauthorized request entries it > looks > > like almost everything is okay, execpt for the node file and > > root localhost (this should be root ubuntu. > > > > Thank you for your help! > > Aaron > > > > Here is an except from the server log... > > > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened > > 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server > ubuntu > > started, initialization type = 4 > > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened > > 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server > ubuntu > > started, initialization type = 4 > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file > > /var/spool/torque/server_priv/accounting/20110927 opened > > 09/27/2011 > 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() > > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot > open node > > description file '/var/spool/torque/server_priv/nodes' in > > setup_nodes() > > 09/27/2011 > 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > recovered 0 queues > > 09/27/2011 > 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > recovered 0 jobs > > 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using > ports > > Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') > > 09/27/2011 > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > parent is exiting > > 09/27/2011 > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > parent is exiting > > 09/27/2011 > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > child process in background > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server > Ready, > > pid = 11995, loglevel=0 > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > reply > > code=15007(Unauthorized Request ), aux=0, type=Manager, from > > root at localhost > > 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque > Server > > Version = 3.0.2, loglevel = 0 > > 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque > Server > > Version = 3.0.2, loglevel = 0 > > ... > > > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > > > >> wrote: > > > > Are you seeing anything in the pbs_server logs? > > > > -Jerry > > > > From: Aaron T Perry > > >> > > Reply-To: Torque Users Mailing List > > > > >> > > Date: Tue, 27 Sep 2011 09:53:28 -0400 > > To: > > >> > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > Please, any help you can give would be greatly > appreciated, > > I'm completely stuck. All the solutions I found > online have > > failed. > > > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron > > > >> wrote: > > > > Hi, > > > > I've just tried to install torque, and I ran the > > following commands, > > > > ./configure > > sudo make > > sudo make install > > > > however when I run ./torque.setup username I get the > > following... > > > > initializing TORQUE (admin: username at ubuntu) > > PBS_Server ubuntu: Create mode and server > database exists, > > do you wish to continue y/(n)?y > > Max open servers: 9 > > qmgr obj= svr=default: Unauthorized Request > > Max open servers: 9 > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj=batch svr=default: Unauthorized Request > > qmgr obj= svr=default: Unauthorized Request > > > > The server lanched and I cannot stop it, nor can > issue > > any command related to torque (qterm, gmgr, qsub, > etc) > > under my current username or under root. Help! > > > > Thank you, > > -Aaron > > > > > > _______________________________________________ > torqueusers > > mailing list torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ torqueusers > > mailing list torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ torqueusers > mailing > > list torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Tue Sep 27 13:14:43 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 27 Sep 2011 14:14:43 -0500 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: If you continue to have problems, I woud check if this is really a problem with the Virtual environment. I would suggest creating a liveUSB stick with a linux distribution on it. I've used LinuxLive USB Creator: http://www.linuxliveusb.com/ to create a bootable USB thumbdrive. I chose 8GB, and did not have to be picky about the number of packages I use. The installer runs on Windows, and will create a Linux distribution (I use fedora) on a new thumb drive. It has a GUI interface, and you don't need special knowledge other than knowing about Unix. (The persistent image is the portiona that you can use to make updates to your install, I made that 2GB on my 8GB stick.) Pick as many packages as you think that you need, and they will be downloaded and installed. If you need something later, you can use yum to install it. E.g. yum install gcc-gfortran boost boost-devel java* If your computer's boot order is set to boot from a removable drive before a far disk, you can just reboot the computer and it should boot from USB. If this is not set, you can interrupt the boot processes when it says something like BOOT ORDER F12 by pressing the F12 key, and then selecting the USB drive. You can run off the USB drive and it will not affect you machine at all, just shutdown, pull the USB stick, and power up and your back in Windows again. This could check whether the problem is with gcc or with the Virtual machine (did you make the virtual machine with enough memory?) James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Aaron T Perry Sent: Tuesday, September 27, 2011 1:55 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Help: Unauthorized Request Thank you, I'll try these suggestions. I'm relatively new at this and sometimes feel I'm in over my head. I'm almost certain this is a compiler or stack limit error. I didn't write the code, and it's known to work on a variety of systems but only commercial compilers are officially supported (i'm using gcc). Thank you again, Aaron On Tue, Sep 27, 2011 at 2:45 PM, Coyle, James J [ITACD] > wrote: For just one computer, which write the following script files (assumes you have 256GB of memory, modify as needed.) scr0: #!/bin/bash for j in 1G 2G 4G 8G 16G 32G 64G 128G 256G ; do echo "Try $j " ./scr1 $j done exit scr1: #!/bin/csh -f setenv F t1.$$ /bin/rm -f $F hostname > $F limit stacksize $1 mpirun -n 4 --machinefile $F ./ccsm.exe /bin/rm -f $F exit make both executable with chmod u+x scr0 scr1 and then issue ./scr0 Modify the above procedure as needed. If this is not just caused by a stack limit error, I'd look at either a compiler optimization bug (recompile run with -O0 and run) or more likely a programming error (we all make them.) I'd recompile and check for bounds (e.g. -C on most Fortran compilers), and uninitialized variables (-uvar on PathScale or Open64 compilers. -rabc also works well on Cray Compilers. You can also use a parallel debugger like Totalview or DDT, or you can use a run-time error detection tool like MPI-CHECK (Fortran only) or Marmot. (See http://rted.public.iastate.edu/MPI/RESULTS/result_table.html for the kinds of errors that these can catch) See http://rted.public.iastate.edu/Serial/RESULTS/result_table.html for program errors other than those involving MPI routines. If you click on items under the OS/Compiler/Runtime tool column, you can see the suggested compiler options for best debugging for that Compiler or tool. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Aaron T Perry Sent: Tuesday, September 27, 2011 12:23 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Help: Unauthorized Request Yes, that I do i have, that was the first thing I came across when looking through help online. And I added the nodes file with appropriate settings for my machine, but I still get the same errors. I have a completely unrelated question. I'm doing all this to run a model that I've been trying to port. I'm trying to figure out whether a segmentation fault I'm getting at runtime (using mpirun ./ccsm.exe) is due to a compiler error, or a stack/memory error (the code works on many other machines, not necessarily the compiler I'm using though). If I can install torque I can use an automated script that also sets appropriate stack size, among other things. I am on 1 computer, with 1 node, and I have no desire to scale this instance of the model. Basically I'm wondering if you think there might be an easier/better alternative? Thank you, Aaron On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > wrote: $PBS_HOME/server_priv/nodes needs to encompass your compute nodes node1 np=4 # or however many cores you have node2 np=4 Make sure that those nodes can be resolved via those names from the admin node. Do you have $PBS_HOME/server_name file with the resolvable name of your admin server? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:58:50 -0400 To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request I think I have, I needed to create the file, and I was unsure about the formatting required. This is what I have there. # + + ubuntu atp42 Do I also need to create the nodes file in the torque>server_priv directory? Thanks, Aaron On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > wrote: Have you set up hosts.equiv? see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml 1.3.2.1 Server Configuration Overview There are several steps to ensure that the server and the nodes are completely aware of each other and able to communicate directly. Some of this configuration takes place within TORQUE directly using the qmgr command. Other configuration settings are managed using the pbs_servernodes file, DNS files such as /etc/hosts and the /etc/hosts.equiv file. 1.3.2.2 Name Service Configuration Each node, as well as the server, must be able to resolve the name of every node with which it will interact. This can be accomplished using /etc/hosts, DNS, NIS, or other mechanisms. In the case of /etc/hosts, the file can be shared across systems in most cases. -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:33:31 -0400 To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request With the execption of the unauthorized request entries it looks like almost everything is okay, execpt for the node file and root localhost (this should be root ubuntu. Thank you for your help! Aaron Here is an except from the server log... 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, initialization type = 4 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/server_priv/accounting/20110927 opened 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description file '/var/spool/torque/server_priv/nodes' in setup_nodes() 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 queues 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent is exiting 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child process in background 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 11995, loglevel=0 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 ... On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > wrote: Are you seeing anything in the pbs_server logs? -Jerry From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 09:53:28 -0400 To: > Subject: Re: [torqueusers] Help: Unauthorized Request Please, any help you can give would be greatly appreciated, I'm completely stuck. All the solutions I found online have failed. On Mon, Sep 26, 2011 at 2:35 PM, Aaron > wrote: Hi, I've just tried to install torque, and I ran the following commands, ./configure sudo make sudo make install however when I run ./torque.setup username I get the following... initializing TORQUE (admin: username at ubuntu) PBS_Server ubuntu: Create mode and server database exists, do you wish to continue y/(n)?y Max open servers: 9 qmgr obj= svr=default: Unauthorized Request Max open servers: 9 qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj=batch svr=default: Unauthorized Request qmgr obj= svr=default: Unauthorized Request The server lanched and I cannot stop it, nor can issue any command related to torque (qterm, gmgr, qsub, etc) under my current username or under root. Help! Thank you, -Aaron _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/6de1df57/attachment-0001.html From atp42 at cornell.edu Tue Sep 27 13:15:55 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 15:15:55 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: <4E821D92.3080502@ldeo.columbia.edu> References: <4E821198.5090409@ldeo.columbia.edu> <4E821D92.3080502@ldeo.columbia.edu> Message-ID: That might be part of the issue. I also just checked my system resources and there is a lot more being used than i anticipated, I'm using 94% of available memory, and most of the cores are operating at >60%. I was working in a virtual machine to troubleshoot (I have multiple VM with different configurations, not running simultaneously). I'm using Open MPI version 1.4.3, and I did test it with a very basic program to make sure it compiles and runs properly (I had no errors, warnings, or other odd behavior). My ultimate goal is to move some of the fixes I found on the VM back to our cluster, but our sysadmin isn't very familiar with torque either, so trying to get it work here was another one of my goals. Are you using a commercial compiler to run the CESM? I was trying to get it to work with gcc version 4.4.2, but was running into a multitude of compilation errors. I'll let you know if I get everything working on the virtual machine. Thanks, Aaron On Tue, Sep 27, 2011 at 3:01 PM, Gus Correa wrote: > Now I wonder if part of the problem is due to it being > a virtual machine. > > - Does torque work in a virtual environment? > - How does MPI {whatever MPI you're using] behave > [works?, performs well?] in a virtual environment? > - Does something as big as ccsm [your ultimate goal apparently] > work in a virtual environment? > > Honestly, I don't really know. > > For what it is worth, we run ccsm/cesm in a Linux cluster with > Torque, OpenMPI, etc. > No virtualization, though. > > Gus Correa > > Aaron T Perry wrote: > > This is a single machine, it's a virtual machine running on my Windows 7 > > desktop. Thanks, I'm trying your suggestion now. > > > > Thanks, > > Aaron > > > > On Tue, Sep 27, 2011 at 2:10 PM, Gus Correa > > wrote: > > > > Aron > > > > You can set the stack size unlimited in /etc/security/limits.conf > > (here along with locked memory and number of open files): > > > > * - memlock -1 > > * - stack -1 > > * - nofile 4096 > > > > Granted that the above is RHEL/CentOS style, > > Debian/Ubuntu may be different/different file. > > > > Also, you may want to check your /var/log/messages [or whatever > Ubuntu > > uses for system logs] and see if it sheds more light into > > the pbs_server errors. > > > > My guess is that you need consistent server names in server_name, > > server_priv/nodes [assuming your server is also a work > > node running pbs_mom], mom_priv/config (for $pbsserver). > > My recollection is that these default to 'localhost' [and 127.0.0.1], > > if your installation is in a *single standalone machine*, > > but I am not sure. > > And you need right name resolution in /etc/hosts, as Mike Reppert > > and Jerry Smith pointed out. > > > > Also, not related, but you need to enable scheduling [after the > > current problem is sorted out]: > > > > qmgr -c 'set server scheduling = True' > > > > Out of curiosity, is it a single machine or a small cluster? > > > > I hope this helps, > > Gus Correa > > > > Aaron T Perry wrote: > > > Yes, that I do i have, that was the first thing I came across when > > > looking through help online. > > > > > > And I added the nodes file with appropriate settings for my > > machine, but > > > I still get the same errors. > > > > > > I have a completely unrelated question. I'm doing all this to run > a > > > model that I've been trying to port. I'm trying to figure out > > whether a > > > segmentation fault I'm getting at runtime (using mpirun > > ./ccsm.exe) is > > > due to a compiler error, or a stack/memory error (the code works > > on many > > > other machines, not necessarily the compiler I'm using though). > > If I can > > > install torque I can use an automated script that also > > > sets appropriate stack size, among other things. I am on 1 > computer, > > > with 1 node, and I have no desire to scale this instance of the > > model. > > > Basically I'm wondering if you think there might be an > easier/better > > > alternative? > > > > > > Thank you, > > > Aaron > > > > > > > > > On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > > > > > >> wrote: > > > > > > $PBS_HOME/server_priv/nodes needs to encompass your compute > nodes > > > > > > node1 np=4 # or however many cores you have > > > node2 np=4 > > > > > > Make sure that those nodes can be resolved via those names > > from the > > > admin node. > > > > > > Do you have $PBS_HOME/server_name file with the resolvable > > name of > > > your admin server? > > > > > > -Jerry > > > > > > From: Aaron T Perry > > >> > > > Reply-To: Torque Users Mailing List > > > > > > >> > > > Date: Tue, 27 Sep 2011 12:58:50 -0400 > > > > > > To: Torque Users Mailing List > > > > > >> > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > > > I think I have, I needed to create the file, and I was unsure > > about > > > the formatting required. > > > This is what I have there. > > > > > > # + + ubuntu atp42 > > > > > > Do I also need to create the nodes file in the > torque>server_priv > > > directory? > > > > > > Thanks, > > > Aaron > > > > > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > > > > > >> wrote: > > > > > > Have you set up hosts.equiv? > > > > > > see: > > http://www.clusterresources.com/torquedocs/1.3advconfig.shtml > > > > > > > > > 1.3.2.1 Server Configuration Overview > > > > > > There are several steps to ensure that the server and the > > nodes > > > are completely aware of each other and able to communicate > > > directly. Some of this configuration takes place within > > TORQUE > > > directly using the *qmgr* command. Other configuration > > settings > > > are managed using the *pbs_server*nodes file, DNS files > > such as > > > /etc/hosts and the /etc/hosts.equiv file. > > > > > > > > > 1.3.2.2 Name Service Configuration > > > > > > Each node, as well as the server, must be able to resolve > the > > > name of every node with which it will interact. This can > be > > > accomplished using /etc/hosts, *DNS*, *NIS*, or other > > > mechanisms. In the case of /etc/hosts, the file can be > shared > > > across systems in most cases. > > > > > > > > > -Jerry > > > > > > > > > From: Aaron T Perry > > >> > > > Reply-To: Torque Users Mailing List > > > > > > > >> > > > Date: Tue, 27 Sep 2011 12:33:31 -0400 > > > > > > To: Torque Users Mailing List > > > > > > >> > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > > > With the execption of the unauthorized request entries it > > looks > > > like almost everything is okay, execpt for the node file > and > > > root localhost (this should be root ubuntu. > > > > > > Thank you for your help! > > > Aaron > > > > > > Here is an except from the server log... > > > > > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened > > > 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server > > ubuntu > > > started, initialization type = 4 > > > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened > > > 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server > > ubuntu > > > started, initialization type = 4 > > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file > > > /var/spool/torque/server_priv/accounting/20110927 opened > > > 09/27/2011 > > 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() > > > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot > > open node > > > description file '/var/spool/torque/server_priv/nodes' in > > > setup_nodes() > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > > recovered 0 queues > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > > recovered 0 jobs > > > 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using > > ports > > > Server:15001 Scheduler:15004 MOM:15002 (server: > 'ubuntu') > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > > parent is exiting > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > > parent is exiting > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > > child process in background > > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server > > Ready, > > > pid = 11995, loglevel=0 > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, type=Manager, > from > > > root at localhost > > > 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque > > Server > > > Version = 3.0.2, loglevel = 0 > > > 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque > > Server > > > Version = 3.0.2, loglevel = 0 > > > ... > > > > > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > > > > > >> wrote: > > > > > > Are you seeing anything in the pbs_server logs? > > > > > > -Jerry > > > > > > From: Aaron T Perry > > > > >>> > > > Reply-To: Torque Users Mailing List > > > > > > > > >> > > > Date: Tue, 27 Sep 2011 09:53:28 -0400 > > > To: > > > > > >> > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > > > Please, any help you can give would be greatly > > appreciated, > > > I'm completely stuck. All the solutions I found > > online have > > > failed. > > > > > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron > > > > > > >> wrote: > > > > > > Hi, > > > > > > I've just tried to install torque, and I ran the > > > following commands, > > > > > > ./configure > > > sudo make > > > sudo make install > > > > > > however when I run ./torque.setup username I get > the > > > following... > > > > > > initializing TORQUE (admin: username at ubuntu) > > > PBS_Server ubuntu: Create mode and server > > database exists, > > > do you wish to continue y/(n)?y > > > Max open servers: 9 > > > qmgr obj= svr=default: Unauthorized Request > > > Max open servers: 9 > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj=batch svr=default: Unauthorized Request > > > qmgr obj=batch svr=default: Unauthorized Request > > > qmgr obj=batch svr=default: Unauthorized Request > > > qmgr obj=batch svr=default: Unauthorized Request > > > qmgr obj=batch svr=default: Unauthorized Request > > > qmgr obj=batch svr=default: Unauthorized Request > > > qmgr obj= svr=default: Unauthorized Request > > > > > > The server lanched and I cannot stop it, nor can > > issue > > > any command related to torque (qterm, gmgr, qsub, > > etc) > > > under my current username or under root. Help! > > > > > > Thank you, > > > -Aaron > > > > > > > > > _______________________________________________ > > torqueusers > > > mailing list torqueusers at supercluster.org > > > > > > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > _______________________________________________ > torqueusers > > > mailing list torqueusers at supercluster.org > > > > > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > _______________________________________________ torqueusers > > mailing > > > list torqueusers at supercluster.org > > > > > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/cadfeef5/attachment-0001.html From atp42 at cornell.edu Tue Sep 27 13:18:57 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Tue, 27 Sep 2011 15:18:57 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: I gave the virtual machine half of the system memory, but give how much RAM is being used that's probably not enough. I'll try this, thanks for the suggestions! -Aaron On Tue, Sep 27, 2011 at 3:14 PM, Coyle, James J [ITACD] wrote: > If you continue to have problems, I woud check if this is really a > problem with the Virtual environment.**** > > ** ** > > I would suggest creating a liveUSB stick with a linux distribution on > it. **** > > ** ** > > I?ve used LinuxLive USB Creator: http://www.linuxliveusb.com/**** > > ** ** > > to create a bootable USB thumbdrive.**** > > ** ** > > I chose 8GB, and did not have to be picky about the number of packages I > use.**** > > ** ** > > The installer runs on Windows, and will create a Linux distribution (I use > fedora) on a new thumb drive.**** > > It has a GUI interface, and you don?t need special knowledge other than > knowing about Unix.**** > > (The persistent image is the portiona that you can use to make updates to > your install, I made that 2GB **** > > on my 8GB stick.)**** > > ** ** > > Pick as many packages as you think that you need, and they will be > downloaded and installed. If you**** > > need something later, you can use yum to install it.**** > > ** ** > > E.g. **** > > ** ** > > yum install gcc-gfortran boost boost-devel java***** > > ** ** > > ** ** > > If your computer?s boot order is set to boot from a removable drive > before a far disk,**** > > you can just reboot the computer and it should boot from USB. If this is > not set,**** > > you can interrupt the boot processes when it says something like BOOT ORDER > F12**** > > by pressing the F12 key, and then selecting the USB drive.**** > > ** ** > > You can run off the USB drive and it will not affect you machine at all, > just shutdown, **** > > pull the USB stick, and power up and your back in Windows again.**** > > ** ** > > This could check whether the problem is with gcc or with the Virtual > machine (did you make the virtual machine with enough memory?)**** > > ** ** > > ** ** > > James Coyle, PhD**** > > High Performance Computing Group **** > > Iowa State Univ. **** > > web: http://jjc.public.iastate.edu/ ** > ** > > ** ** > > ** ** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Aaron T Perry > *Sent:* Tuesday, September 27, 2011 1:55 PM > > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] Help: Unauthorized Request**** > > ** ** > > Thank you, I'll try these suggestions. I'm relatively new at this and > sometimes feel I'm in over my head.**** > > ** ** > > I'm almost certain this is a compiler or stack limit error. I didn't write > the code, and it's known to work on a variety of systems but > only commercial compilers are officially supported (i'm using gcc).**** > > ** ** > > Thank you again,**** > > Aaron**** > > ** ** > > On Tue, Sep 27, 2011 at 2:45 PM, Coyle, James J [ITACD] > wrote:**** > > For just one computer, which write the following script files (assumes you > have 256GB of memory,**** > > modify as needed.)**** > > **** > > scr0:**** > > #!/bin/bash**** > > **** > > for j in 1G 2G 4G 8G 16G 32G 64G 128G 256G ; do**** > > echo ?Try $j ?**** > > ./scr1 $j**** > > done **** > > exit**** > > **** > > **** > > **** > > scr1:**** > > #!/bin/csh ?f**** > > **** > > setenv F t1.$$**** > > /bin/rm ?f $F**** > > hostname > $F**** > > limit stacksize $1**** > > mpirun ?n 4 -?machinefile $F ./ccsm.exe**** > > /bin/rm ?f $F**** > > exit**** > > **** > > **** > > make both executable with **** > > chmod u+x scr0 scr1**** > > **** > > **** > > and then issue **** > > **** > > ./scr0**** > > **** > > **** > > Modify the above procedure as needed.**** > > **** > > If this is not just caused by a stack limit error, I?d look at either a > compiler optimization bug (recompile run with ?O0**** > > and run) or more likely a programming error (we all make them.)**** > > **** > > I?d recompile and check for bounds (e.g. ?C on most Fortran compilers), > and uninitialized variables (-uvar on**** > > PathScale or Open64 compilers. ?rabc also works well on Cray Compilers.** > ** > > **** > > You can also use a parallel debugger like Totalview or DDT, or you can > use a run-time error detection tool**** > > like MPI-CHECK (Fortran only) or Marmot. (See > http://rted.public.iastate.edu/MPI/RESULTS/result_table.html**** > > for the kinds of errors that these can catch) See > http://rted.public.iastate.edu/Serial/RESULTS/result_table.html **** > > for program errors other than those involving MPI routines. If you click > on items under the**** > > OS/Compiler/Runtime tool column, you can see the suggested compiler options > for best debugging**** > > for that Compiler or tool.**** > > **** > > **** > > **** > > **** > > **** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Aaron T Perry > *Sent:* Tuesday, September 27, 2011 12:23 PM**** > > > *To:* Torque Users Mailing List**** > > *Subject:* Re: [torqueusers] Help: Unauthorized Request**** > > **** > > Yes, that I do i have, that was the first thing I came across when looking > through help online.**** > > **** > > And I added the nodes file with appropriate settings for my machine, but I > still get the same errors.**** > > **** > > I have a completely unrelated question. I'm doing all this to run a model > that I've been trying to port. I'm trying to figure out whether a > segmentation fault I'm getting at runtime (using mpirun ./ccsm.exe) is due > to a compiler error, or a stack/memory error (the code works on many other > machines, not necessarily the compiler I'm using though). If I can install > torque I can use an automated script that also sets appropriate stack size, > among other things. I am on 1 computer, with 1 node, and I have no desire to > scale this instance of the model. Basically I'm wondering if you think there > might be an easier/better alternative? **** > > **** > > Thank you,**** > > Aaron**** > > **** > > On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > wrote:**** > > $PBS_HOME/server_priv/nodes needs to encompass your compute nodes**** > > **** > > node1 np=4 # or however many cores you have**** > > node2 np=4**** > > **** > > Make sure that those nodes can be resolved via those names from the admin > node.**** > > **** > > Do you have $PBS_HOME/server_name file with the resolvable name of your > admin server?**** > > **** > > -Jerry**** > > **** > > *From: *Aaron T Perry > *Reply-To: *Torque Users Mailing List **** > > *Date: *Tue, 27 Sep 2011 12:58:50 -0400**** > > > *To: *Torque Users Mailing List > *Subject: *Re: [torqueusers] Help: Unauthorized Request**** > > **** > > I think I have, I needed to create the file, and I was unsure about the > formatting required. **** > > This is what I have there.**** > > **** > > # + + ubuntu atp42**** > > **** > > Do I also need to create the nodes file in the torque>server_priv > directory?**** > > **** > > Thanks,**** > > Aaron**** > > **** > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > wrote:**** > > Have you set up hosts.equiv?**** > > **** > > see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml**** > > **** > 1.3.2.1 Server Configuration Overview**** > > There are several steps to ensure that the server and the nodes are > completely aware of each other and able to communicate directly. Some of > this configuration takes place within TORQUE directly using the *qmgr*command. Other configuration settings are managed using the > *pbs_server*nodes file, DNS files such as /etc/hosts and the > /etc/hosts.equiv file.**** > 1.3.2.2 Name Service Configuration**** > > Each node, as well as the server, must be able to resolve the name of every > node with which it will interact. This can be accomplished using > /etc/hosts, *DNS*, *NIS*, or other mechanisms. In the case of /etc/hosts, > the file can be shared across systems in most cases.**** > > **** > > -Jerry**** > > **** > > *From: *Aaron T Perry > *Reply-To: *Torque Users Mailing List **** > > *Date: *Tue, 27 Sep 2011 12:33:31 -0400 **** > > > *To: *Torque Users Mailing List **** > > *Subject: *Re: [torqueusers] Help: Unauthorized Request**** > > **** > > With the execption of the unauthorized request entries it looks like almost > everything is okay, execpt for the node file and root localhost (this > should be root ubuntu. **** > > **** > > Thank you for your help!**** > > Aaron**** > > **** > > Here is an except from the server log...**** > > **** > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened**** > > 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, > initialization type = 4**** > > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened**** > > 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, > initialization type = 4**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file > /var/spool/torque/server_priv/accounting/20110927 opened**** > > 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes()**** > > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node description > file '/var/spool/torque/server_priv/nodes' in setup_nodes()**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 > queues**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 > jobs**** > > 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 > Scheduler:15004 MOM:15002 (server: 'ubuntu')**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent > is exiting**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent > is exiting**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child > process in background**** > > 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = > 11995, loglevel=0**** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply > code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > **** > > 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 3.0.2, loglevel = 0**** > > 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 3.0.2, loglevel = 0**** > > ...**** > > **** > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > wrote:**** > > Are you seeing anything in the pbs_server logs?**** > > **** > > -Jerry**** > > **** > > *From: *Aaron T Perry > *Reply-To: *Torque Users Mailing List > *Date: *Tue, 27 Sep 2011 09:53:28 -0400 > *To: * > *Subject: *Re: [torqueusers] Help: Unauthorized Request**** > > **** > > Please, any help you can give would be greatly appreciated, I'm completely > stuck. All the solutions I found online have failed. **** > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron wrote:**** > > Hi, **** > > **** > > I've just tried to install torque, and I ran the following commands,**** > > **** > > ./configure**** > > sudo make**** > > sudo make install**** > > **** > > however when I run ./torque.setup username I get the following...**** > > **** > > initializing TORQUE (admin: username at ubuntu)**** > > PBS_Server ubuntu: Create mode and server database exists, **** > > do you wish to continue y/(n)?y**** > > Max open servers: 9**** > > qmgr obj= svr=default: Unauthorized Request **** > > Max open servers: 9**** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj=batch svr=default: Unauthorized Request **** > > qmgr obj= svr=default: Unauthorized Request **** > > **** > > The server lanched and I cannot stop it, nor can issue any command related > to torque (qterm, gmgr, qsub, etc) under my current username or under root. > Help!**** > > **** > > Thank you,**** > > -Aaron**** > > **** > > **** > > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > **** > > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > **** > > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/5e514f30/attachment-0001.html From gus at ldeo.columbia.edu Tue Sep 27 13:41:54 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 27 Sep 2011 15:41:54 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: <4E821198.5090409@ldeo.columbia.edu> <4E821D92.3080502@ldeo.columbia.edu> Message-ID: <4E822702.7000204@ldeo.columbia.edu> Aaron T Perry wrote: > That might be part of the issue. I also just checked my system resources > and there is a lot more being used than i anticipated, I'm using 94% of > available memory, and most of the cores are operating at >60%. I was > working in a virtual machine to troubleshoot (I have multiple VM with > different configurations, not running simultaneously). > If this is while running ccsm, you should expect near 100% core use. As for RAM, it depends on how much you have and which ccsm configuration you are trying to run. Ccsm uses a lot of resources, even on small configurations. If you are paging memory, that is not a good sign, and may as well explain a segfault. Anyway, this is more of a NCAR CGD Bulletin Board discussion than a Torque discussion. > I'm using Open MPI version 1.4.3, and I did test it with a very basic > program to make sure it compiles and runs properly (I had no errors, > warnings, or other odd behavior). You're right. Testing with the basic programs [connectivity_c.c, ring_c.c, hello_c.c] is the right thing to do. > > My ultimate goal is to move some of the fixes I found on the VM back to > our cluster, but our sysadmin isn't very familiar with torque either, so > trying to get it work here was another one of my goals. > My guess is that your development platform [VM] is more complicated than your production one [cluster] Have you tried the cluster directly? Does it have torque installed? You could either ask the sysadmin to install torque [if not available] and OpenMPI --with-tm support. Or, if torque is not available and there is no resource manager, just install OpenMPI in your home directory [assuming it is NFS mounted on the compute nodes] and run ccsm straight from mpiexec. > Are you using a commercial compiler to run the CESM? I was trying to get > it to work with gcc version 4.4.2, but was running into a multitude of > compilation errors. > I haven't tried it with gcc/gfortran, but it may work. I have been using Intel compilers icc/ifort and OpenMPI 1.4.3 also. There are several road blocks to ccsm/cesm. The compiler may be the lesser of them. Gus Correa > I'll let you know if I get everything working on the virtual machine. > > Thanks, > Aaron > > On Tue, Sep 27, 2011 at 3:01 PM, Gus Correa > wrote: > > Now I wonder if part of the problem is due to it being > a virtual machine. > > - Does torque work in a virtual environment? > - How does MPI {whatever MPI you're using] behave > [works?, performs well?] in a virtual environment? > - Does something as big as ccsm [your ultimate goal apparently] > work in a virtual environment? > > Honestly, I don't really know. > > For what it is worth, we run ccsm/cesm in a Linux cluster with > Torque, OpenMPI, etc. > No virtualization, though. > > Gus Correa > > Aaron T Perry wrote: > > This is a single machine, it's a virtual machine running on my > Windows 7 > > desktop. Thanks, I'm trying your suggestion now. > > > > Thanks, > > Aaron > > > > On Tue, Sep 27, 2011 at 2:10 PM, Gus Correa > > > >> wrote: > > > > Aron > > > > You can set the stack size unlimited in /etc/security/limits.conf > > (here along with locked memory and number of open files): > > > > * - memlock -1 > > * - stack -1 > > * - nofile 4096 > > > > Granted that the above is RHEL/CentOS style, > > Debian/Ubuntu may be different/different file. > > > > Also, you may want to check your /var/log/messages [or > whatever Ubuntu > > uses for system logs] and see if it sheds more light into > > the pbs_server errors. > > > > My guess is that you need consistent server names in server_name, > > server_priv/nodes [assuming your server is also a work > > node running pbs_mom], mom_priv/config (for $pbsserver). > > My recollection is that these default to 'localhost' [and > 127.0.0.1], > > if your installation is in a *single standalone machine*, > > but I am not sure. > > And you need right name resolution in /etc/hosts, as Mike Reppert > > and Jerry Smith pointed out. > > > > Also, not related, but you need to enable scheduling [after the > > current problem is sorted out]: > > > > qmgr -c 'set server scheduling = True' > > > > Out of curiosity, is it a single machine or a small cluster? > > > > I hope this helps, > > Gus Correa > > > > Aaron T Perry wrote: > > > Yes, that I do i have, that was the first thing I came > across when > > > looking through help online. > > > > > > And I added the nodes file with appropriate settings for my > > machine, but > > > I still get the same errors. > > > > > > I have a completely unrelated question. I'm doing all this > to run a > > > model that I've been trying to port. I'm trying to figure out > > whether a > > > segmentation fault I'm getting at runtime (using mpirun > > ./ccsm.exe) is > > > due to a compiler error, or a stack/memory error (the code > works > > on many > > > other machines, not necessarily the compiler I'm using > though). > > If I can > > > install torque I can use an automated script that also > > > sets appropriate stack size, among other things. I am on 1 > computer, > > > with 1 node, and I have no desire to scale this instance > of the > > model. > > > Basically I'm wondering if you think there might be an > easier/better > > > alternative? > > > > > > Thank you, > > > Aaron > > > > > > > > > On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II > > > > > > > > >>> wrote: > > > > > > $PBS_HOME/server_priv/nodes needs to encompass your > compute nodes > > > > > > node1 np=4 # or however many cores you have > > > node2 np=4 > > > > > > Make sure that those nodes can be resolved via those names > > from the > > > admin node. > > > > > > Do you have $PBS_HOME/server_name file with the resolvable > > name of > > > your admin server? > > > > > > -Jerry > > > > > > From: Aaron T Perry > > > > > > >>> > > > Reply-To: Torque Users Mailing List > > > > > > > > > >>> > > > Date: Tue, 27 Sep 2011 12:58:50 -0400 > > > > > > To: Torque Users Mailing List > > > > > > > > > >>> > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > > > I think I have, I needed to create the file, and I was > unsure > > about > > > the formatting required. > > > This is what I have there. > > > > > > # + + ubuntu atp42 > > > > > > Do I also need to create the nodes file in the > torque>server_priv > > > directory? > > > > > > Thanks, > > > Aaron > > > > > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II > > > > > > > > >>> wrote: > > > > > > Have you set up hosts.equiv? > > > > > > see: > > http://www.clusterresources.com/torquedocs/1.3advconfig.shtml > > > > > > > > > 1.3.2.1 Server Configuration Overview > > > > > > There are several steps to ensure that the server > and the > > nodes > > > are completely aware of each other and able to > communicate > > > directly. Some of this configuration takes place > within > > TORQUE > > > directly using the *qmgr* command. Other configuration > > settings > > > are managed using the *pbs_server*nodes file, DNS > files > > such as > > > /etc/hosts and the /etc/hosts.equiv file. > > > > > > > > > 1.3.2.2 Name Service Configuration > > > > > > Each node, as well as the server, must be able to > resolve the > > > name of every node with which it will interact. > This can be > > > accomplished using /etc/hosts, *DNS*, *NIS*, or other > > > mechanisms. In the case of /etc/hosts, the file > can be shared > > > across systems in most cases. > > > > > > > > > -Jerry > > > > > > > > > From: Aaron T Perry > > > > > > >>> > > > Reply-To: Torque Users Mailing List > > > > > > > > > > >>> > > > Date: Tue, 27 Sep 2011 12:33:31 -0400 > > > > > > To: Torque Users Mailing List > > > > > > > > > >>> > > > Subject: Re: [torqueusers] Help: Unauthorized Request > > > > > > With the execption of the unauthorized request > entries it > > looks > > > like almost everything is okay, execpt for the > node file and > > > root localhost (this should be root ubuntu. > > > > > > Thank you for your help! > > > Aaron > > > > > > Here is an except from the server log... > > > > > > 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened > > > 09/27/2011 > 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server > > ubuntu > > > started, initialization type = 4 > > > 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened > > > 09/27/2011 > 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server > > ubuntu > > > started, initialization type = 4 > > > 09/27/2011 > 09:51:44;0002;PBS_Server;Svr;Act;Account file > > > /var/spool/torque/server_priv/accounting/20110927 > opened > > > 09/27/2011 > > 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() > > > 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot > > open node > > > description file > '/var/spool/torque/server_priv/nodes' in > > > setup_nodes() > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > > recovered 0 queues > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, > > > recovered 0 jobs > > > 09/27/2011 > 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using > > ports > > > Server:15001 Scheduler:15004 MOM:15002 (server: > 'ubuntu') > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > > parent is exiting > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > > parent is exiting > > > 09/27/2011 > > 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: > > > child process in background > > > 09/27/2011 > 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server > > Ready, > > > pid = 11995, loglevel=0 > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:44;0080;PBS_Server;Req;req_reject;Reject > > reply > > > code=15007(Unauthorized Request ), aux=0, > type=Manager, from > > > root at localhost > > > 09/27/2011 > 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque > > Server > > > Version = 3.0.2, loglevel = 0 > > > 09/27/2011 > 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque > > Server > > > Version = 3.0.2, loglevel = 0 > > > ... > > > > > > On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II > > > > > > > > >>> wrote: > > > > > > Are you seeing anything in the pbs_server logs? > > > > > > -Jerry > > > > > > From: Aaron T Perry > > > > > > >>> > > > Reply-To: Torque Users Mailing List > > > > > > > > > > > >>> > > > Date: Tue, 27 Sep 2011 09:53:28 -0400 > > > To: > > > > > > > > >>> > > > Subject: Re: [torqueusers] Help: Unauthorized > Request > > > > > > Please, any help you can give would be greatly > > appreciated, > > > I'm completely stuck. All the solutions I found > > online have > > > failed. > > > > > > On Mon, Sep 26, 2011 at 2:35 PM, Aaron > > > > > > > > > >>> wrote: > > > > > > Hi, > > > > > > I've just tried to install torque, and I > ran the > > > following commands, > > > > > > ./configure > > > sudo make > > > sudo make install > > > > > > however when I run ./torque.setup username > I get the > > > following... > > > > > > initializing TORQUE (admin: username at ubuntu) > > > PBS_Server ubuntu: Create mode and server > > database exists, > > > do you wish to continue y/(n)?y > > > Max open servers: 9 > > > qmgr obj= svr=default: Unauthorized Request > > > Max open servers: 9 > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj= svr=default: Unauthorized Request > > > qmgr obj=batch svr=default: Unauthorized > Request > > > qmgr obj=batch svr=default: Unauthorized > Request > > > qmgr obj=batch svr=default: Unauthorized > Request > > > qmgr obj=batch svr=default: Unauthorized > Request > > > qmgr obj=batch svr=default: Unauthorized > Request > > > qmgr obj=batch svr=default: Unauthorized > Request > > > qmgr obj= svr=default: Unauthorized Request > > > > > > The server lanched and I cannot stop it, > nor can > > issue > > > any command related to torque (qterm, > gmgr, qsub, > > etc) > > > under my current username or under root. Help! > > > > > > Thank you, > > > -Aaron > > > > > > > > > _______________________________________________ > > torqueusers > > > mailing list torqueusers at supercluster.org > > > > > > > > > >> > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > > > > >> > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > _______________________________________________ > torqueusers > > > mailing list torqueusers at supercluster.org > > > > > > > > > >> > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > > > >> > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > _______________________________________________ > torqueusers > > mailing > > > list torqueusers at supercluster.org > > > > > > > > > >> > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > > > >> > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > ------------------------------------------------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From d-gitelman at northwestern.edu Tue Sep 27 12:23:47 2011 From: d-gitelman at northwestern.edu (Darren Gitelman) Date: Tue, 27 Sep 2011 13:23:47 -0500 Subject: [torqueusers] using dependencies and arrays Message-ID: I am having a problem with dependencies and job arrays. I've seen several messages on the list about this but no resolution. We are using torque 2.5.8 I first submit several jobs in an array: qsub -l -t 1-3 This returns jobid1[] If I call the job with qsub -l depend:afteranyarray:jobid1[] Then the second job doesn't wait until the job array (jobid1[]) has completed. It starts up about 20 seconds after the various jobs in the job array are started and of course fails since the results of jobid1 aren't ready yet. I've also tried using depend:afterokarray, depend:afterok and depend:afterany. I've also tried submitting the second job with: qsub -W depend:afteranyarray:jobid1[] (as well as the same permutations as above). In this case the second job does hold... forever. When I run checkjob on each job in the array I find they have all completed successfully with an exit status of 0. When I checkjob the held job I get [xxx at quser04 ~]$ checkjob -vvv 1183767 job 1183767 (RM job '1183767.qsched01') AName: xxx.defragment State: Hold Creds: user:xxx group:xxx account:t20213 class:short WallTime: 00:00:00 of 3:58:20 SubmitTime: Tue Sep 27 11:33:33 (Time Queued Total: 1:45:28 Eligible: 00:00:05) NodeMatchPolicy: EXACTNODE Total Requested Tasks: 1 Total Requested Nodes: 1 Req[0] TaskCount: 1 Partition: ALL NodeCount: 1 IWD: /home/xxx UMask: 0000 OutputFile: quser04:/home/xxx/./xxx_logs/xxx.defragment.o1183767 ErrorFile: quser04:/home/xxx/./xxx_logs/xxx.defragment.e1183767 Partition List: quest1,quest2,questgpu1,SHARED SrcRM: torque DstRM: torque DstRMJID: 1183767.qsched01 Submit Args: -V -d . -r y -q short -M d-xxx at xxx.edu -N xxx.defragment -m abe -o ./xxx_logs/ -e ./xxx_logs/ -l walltime=14300 -W depend=afteranyarray:1183766[] /home/xxx/tempcmd20332 Flags: RESTARTABLE Attr: checkpoint StartPriority: 256 PE: 1.00 NOTE: job cannot run (job has hold in place) NOTE: job violates constraints for partition hyperthread (non-idle state 'Hold') NOTE: job violates constraints for partition quest1 (non-idle state 'Hold') NOTE: job violates constraints for partition quest2 (non-idle state 'Hold') NOTE: job violates constraints for partition questgpu1 (non-idle state 'Hold') NOTE: job violates constraints for partition pim (non-idle state 'Hold') BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling iteration) I don't really understand what constraints the job is violating and why the dependency isn't working with either -l or -W. Thanks Darren -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110927/f61b55b4/attachment.html From Gareth.Williams at csiro.au Tue Sep 27 15:23:10 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 28 Sep 2011 07:23:10 +1000 Subject: [torqueusers] Job arrays don't show up in pbstop output In-Reply-To: <601D8486-C76B-44DF-8959-E23AFEB3F2BF@nyu.edu> References: <601D8486-C76B-44DF-8959-E23AFEB3F2BF@nyu.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE3F@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Sreedhar Manchu [mailto:sm4082 at nyu.edu] > Sent: Wednesday, 28 September 2011 1:33 AM > To: Torque Users Mailing List > Subject: [torqueusers] Job arrays don't show up in pbstop output > > Hi, > > Since we upgraded to torque 2.5.8, job arrays are not showing up in > pbstop output. I don't have that much expertise in perl to change the > pbstop code. If anyone has fixed this problem or could tell me pointers > in fixing it, I would greatly appreciate it. > > I have looked for latest version of pbstop but it looks like I do have > the latest one (pbstop-4.16-10.el5 and perl-PBS-0.33-10.el5). > > Thanks in advance, > Sreedhar. We made changes to pbstop some time ago for array jobs and then a more recent round of changes for the new job array display notation. The version probably does not work for array jobs with the older notation anymore. I'd be happy to provide it but would rather it were blessed by Garrick. It seems his USC perl-PBS site has been disabled for some time. Perhaps getting the actual code into the contrib. directory of torque would be possible. Garrick are you still on this list? Gareth From sm4082 at nyu.edu Tue Sep 27 20:22:56 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Tue, 27 Sep 2011 22:22:56 -0400 Subject: [torqueusers] Job arrays don't show up in pbstop output In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE3F@exvic-mbx04.nexus.csiro.au> References: <601D8486-C76B-44DF-8959-E23AFEB3F2BF@nyu.edu> <007DECE986B47F4EABF823C1FBB19C620102B6D6AE3F@exvic-mbx04.nexus.csiro.au> Message-ID: <31F3AEBB-39D1-4623-95EE-CD4E49A2EAEE@nyu.edu> Hi Gareth, Thanks a lot. It would be really great if you or Garrick could provide this. I would really appreciate it. Hopefully, Garrick writes. Over all, providing it through torque would be very helpful to many users. It has a very old version right now. Thanks Sreedhar. Sent from my phone. On Sep 27, 2011, at 17:23, wrote: >> -----Original Message----- >> From: Sreedhar Manchu [mailto:sm4082 at nyu.edu] >> Sent: Wednesday, 28 September 2011 1:33 AM >> To: Torque Users Mailing List >> Subject: [torqueusers] Job arrays don't show up in pbstop output >> >> Hi, >> >> Since we upgraded to torque 2.5.8, job arrays are not showing up in >> pbstop output. I don't have that much expertise in perl to change the >> pbstop code. If anyone has fixed this problem or could tell me pointers >> in fixing it, I would greatly appreciate it. >> >> I have looked for latest version of pbstop but it looks like I do have >> the latest one (pbstop-4.16-10.el5 and perl-PBS-0.33-10.el5). >> >> Thanks in advance, >> Sreedhar. > > We made changes to pbstop some time ago for array jobs and then a more recent round of changes for the new job array display notation. The version probably does not work for array jobs with the older notation anymore. > > I'd be happy to provide it but would rather it were blessed by Garrick. It seems his USC perl-PBS site has been disabled for some time. Perhaps getting the actual code into the contrib. directory of torque would be possible. > > Garrick are you still on this list? > > Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From vt7500 at yahoo.com Tue Sep 27 21:26:50 2011 From: vt7500 at yahoo.com (Vt Vt) Date: Tue, 27 Sep 2011 20:26:50 -0700 (PDT) Subject: [torqueusers] Unauthorized Request In-Reply-To: References: <1317064829.73299.YahooMailNeo@web125119.mail.ne1.yahoo.com> Message-ID: <1317180410.23236.YahooMailNeo@web125109.mail.ne1.yahoo.com> Thanks for the suggestions.. They worked. I have already submitted a few jobs and they are running. ? According to your suggestion, here is what I did [I have a single machine]: ? # contents of my /etc/hosts file 127.0.0.1???? localhost # 127.0.1.1? XXXX.XX.XXX????? (yes I commented that line) nnn.nnn.nnn.nn???? XXXX.XX.XXX??? # the above line was added instead where nnn.nnn.nnn.nn was my static ip and XXXX.XX.XXX was my hostname ? could add a new queue without any "Unauthorized Request" errors ? Thanks for the help! ? ? From: Mike Reppert To: Vt Vt ; Torque Users Mailing List Sent: Tuesday, September 27, 2011 1:19 PM Subject: Re: [torqueusers] Unauthorized Request I ran into exactly the same issue installing on Natty. The solution for me was to modify the /etc/hosts file from the Ubuntu default. I have pasted below some notes that I made on fixing the issues in our install. Best, Mike The installation instructions at http://www.clusterresources.com/torquedocs21/1.1installation.shtml worked very well, up to a few glitches (which cost several days) described below: the first with recognizing the proper host address (fixed by editing the /etc/hosts file) and the second due to problems with password-less ssh for scp file transfer from the compute nodes back to the head (fixed using ssh-keygen and setting the proper permissions).?? The online installation instructions say simply to go to the torque-2.4.8 directory and type (as sudo):?? $ ./configure?? $ make?? $ make install?? I found that in addition, before running Torque (probably better before installation), the file /etc/hosts needs to be modified from the Ubuntu default. In place of the first few lines (the ipv4 part) reading?? 127.0.0.1 localhost?? 127.0.1.1 ?? < more stuff about IPv6 >???? (the lines here are the Ubuntu 11.04 default) one should add the actual (static, internal) ip addresses of both the head and compute nodes:?? 127.0.0.1 localhost?? ?? ?? < same stuff about IPv6 >?? For example, if your head node domain name is headnode (local static ip 192.168.1.100) and you have two compute nodes named compute-0-0 and compute-0-1, the /etc/hosts might look like?? 127.0.0.1 localhost?? 192.168.1.100??headnode 192.168.1.253??compute-0-0 192.168.1.254??compute-0-1?? < same stuff about IPv6 >?? This is important so that the pbs_server recognizes the head node as the actual host -- otherwise, it will be confused and try doing things like communicating with "root at localhost" instead of "root@" (i.e. ?root at headnode? in our example). The problem is that both 127.0.0.1 and 127.0.1.1 are ip addresses which point to the computer itself. One needs the actual static ip of headnode in the second line or either (1) the qmgr on the head node will not recognize the head node itself or (2) the compute nodes will try communicating with "127.0.0.1" (i.e. themselves) rather than with the head node.?? On Mon, Sep 26, 2011 at 3:20 PM, Vt Vt wrote: > >Hi, >I have been baffled by the error "Unauthorized Request" that I keep getting while installing torque. I tried several versions including 3.0.2 and some older versions. > >System:??? Ubuntu 11.04 Natty Narhwal >machine type : its a single cpu (12 core machine) > >I am trying to use just 8 cores for the setup. My questions are: > >(1) how to get rid of this error? >server logs: >PBS_Server;Req;req_reject;Reject reply code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost > > ># qmgr -c 'p s ' ># ># Set server attributes. ># >set server acl_hosts = XXXX >set server log_events = 511 >set server mail_from = adm >set server scheduler_iteration = 600 >set server node_check_rate = 150 >set server tcp_timeout = 6 > >#? create queue batch > >qmgr obj=batch svr=default: Unauthorized Request > > >Qmgr: list server managers >Server XXXX.XX.XXXX.XXX?? (no output) > > > >Qmgr: set server managers+=root at XXXX >qmgr obj= svr=default: Unauthorized Request? > >trying qterm gives the same result. > > > >(2) how do I add a host to server_manager? > > > >(3)? How do I completely uninstall torque when I install torque from a tarball using default parameters? > >(4) has anybody got a detaled info on a ubuntu 11.04 + torque install ? > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers > From Gareth.Williams at csiro.au Tue Sep 27 21:41:35 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 28 Sep 2011 13:41:35 +1000 Subject: [torqueusers] torque upgrade Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE45@exvic-mbx04.nexus.csiro.au> I have a record that there was once a page at: http://www.clusterresources.com/torquedocs21/?id=torque:1.1_installation which included useful upgrade information. I think the information was more complete than: http://www.adaptivecomputing.com/resources/docs/torque/a.eupgrade.php Is anybody able to check? In particular I was wondering for a rolling upgrade, is it best to restart the pbs_server first or last or whenever? Gareth -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110928/6704d2ff/attachment-0001.html From atp42 at cornell.edu Wed Sep 28 07:04:04 2011 From: atp42 at cornell.edu (Aaron T Perry) Date: Wed, 28 Sep 2011 09:04:04 -0400 Subject: [torqueusers] Help: Unauthorized Request In-Reply-To: References: Message-ID: It worked on my virtual machine (well torque anyway...model still isn't cooperating). After trying some of the fixed on this list I decided to tweak my ip settings. I changed my static ip address from 192.168.133.100 to some other IP address, and then updated all settings. I think the original IP address might have been the same as my host computer, I'm not sure what problems that was causing, and what some of the other fixes I already applied had done. Thank you all for your help, Aaron On Tue, Sep 27, 2011 at 1:04 PM, Smith, Jerry Don II wrote: > $PBS_HOME/server_priv/nodes needs to encompass your compute nodes > > node1 np=4 # or however many cores you have > node2 np=4 > > Make sure that those nodes can be resolved via those names from the admin > node. > > Do you have $PBS_HOME/server_name file with the resolvable name of your > admin server? > > -Jerry > > From: Aaron T Perry > Reply-To: Torque Users Mailing List > Date: Tue, 27 Sep 2011 12:58:50 -0400 > > To: Torque Users Mailing List > Subject: Re: [torqueusers] Help: Unauthorized Request > > I think I have, I needed to create the file, and I was unsure about the > formatting required. > This is what I have there. > > # + + ubuntu atp42 > > Do I also need to create the nodes file in the torque>server_priv > directory? > > Thanks, > Aaron > > On Tue, Sep 27, 2011 at 12:40 PM, Smith, Jerry Don II wrote: > >> Have you set up hosts.equiv? >> >> see: http://www.clusterresources.com/torquedocs/1.3advconfig.shtml >> >> 1.3.2.1 Server Configuration Overview >> >> There are several steps to ensure that the server and the nodes are >> completely aware of each other and able to communicate directly. Some of >> this configuration takes place within TORQUE directly using the *qmgr*command. Other configuration settings are managed using the >> *pbs_server*nodes file, DNS files such as /etc/hosts and the >> /etc/hosts.equiv file. >> 1.3.2.2 Name Service Configuration >> >> Each node, as well as the server, must be able to resolve the name of >> every node with which it will interact. This can be accomplished using >> /etc/hosts, *DNS*, *NIS*, or other mechanisms. In the case of /etc/hosts, >> the file can be shared across systems in most cases. >> >> >> -Jerry >> >> From: Aaron T Perry >> Reply-To: Torque Users Mailing List >> Date: Tue, 27 Sep 2011 12:33:31 -0400 >> >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Help: Unauthorized Request >> >> With the execption of the unauthorized request entries it looks like >> almost everything is okay, execpt for the node file and root localhost >> (this should be root ubuntu. >> >> Thank you for your help! >> Aaron >> >> Here is an except from the server log... >> >> 09/27/2011 09:51:31;0002;PBS_Server;Svr;Log;Log opened >> 09/27/2011 09:51:31;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, >> initialization type = 4 >> 09/27/2011 09:51:42;0002;PBS_Server;Svr;Log;Log opened >> 09/27/2011 09:51:42;0006;PBS_Server;Svr;PBS_Server;Server ubuntu started, >> initialization type = 4 >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;Act;Account file >> /var/spool/torque/server_priv/accounting/20110927 opened >> 09/27/2011 09:51:44;0040;PBS_Server;Req;setup_nodes;setup_nodes() >> 09/27/2011 09:51:44;0004;PBS_Server;Svr;ubuntu;cannot open node >> description file '/var/spool/torque/server_priv/nodes' in setup_nodes() >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 >> queues >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 >> jobs >> 09/27/2011 09:51:44;0006;PBS_Server;Svr;PBS_Server;Using ports >> Server:15001 Scheduler:15004 MOM:15002 (server: 'ubuntu') >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent >> is exiting >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: parent >> is exiting >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;daemonize_server;INFO: child >> process in background >> 09/27/2011 09:51:44;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = >> 11995, loglevel=0 >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:44;0080;PBS_Server;Req;req_reject;Reject reply >> code=15007(Unauthorized Request ), aux=0, type=Manager, from root at localhost >> 09/27/2011 09:51:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = >> 3.0.2, loglevel = 0 >> 09/27/2011 09:56:49;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = >> 3.0.2, loglevel = 0 >> ... >> >> On Tue, Sep 27, 2011 at 12:13 PM, Smith, Jerry Don II wrote: >> >>> Are you seeing anything in the pbs_server logs? >>> >>> -Jerry >>> >>> From: Aaron T Perry >>> Reply-To: Torque Users Mailing List >>> Date: Tue, 27 Sep 2011 09:53:28 -0400 >>> To: >>> Subject: Re: [torqueusers] Help: Unauthorized Request >>> >>> Please, any help you can give would be greatly appreciated, I'm >>> completely stuck. All the solutions I found online have failed. >>> >>> On Mon, Sep 26, 2011 at 2:35 PM, Aaron wrote: >>> >>>> Hi, >>>> >>>> I've just tried to install torque, and I ran the following commands, >>>> >>>> ./configure >>>> sudo make >>>> sudo make install >>>> >>>> however when I run ./torque.setup username I get the following... >>>> >>>> initializing TORQUE (admin: username at ubuntu) >>>> PBS_Server ubuntu: Create mode and server database exists, >>>> do you wish to continue y/(n)?y >>>> Max open servers: 9 >>>> qmgr obj= svr=default: Unauthorized Request >>>> Max open servers: 9 >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj=batch svr=default: Unauthorized Request >>>> qmgr obj= svr=default: Unauthorized Request >>>> >>>> The server lanched and I cannot stop it, nor can issue any command >>>> related to torque (qterm, gmgr, qsub, etc) under my current username or >>>> under root. Help! >>>> >>>> Thank you, >>>> -Aaron >>>> >>>> >>> _______________________________________________ torqueusers mailing >>> list torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> _______________________________________________ torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > _______________________________________________ torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110928/74c2f93d/attachment.html From d-gitelman at northwestern.edu Wed Sep 28 07:54:08 2011 From: d-gitelman at northwestern.edu (Darren R Gitelman) Date: Wed, 28 Sep 2011 08:54:08 -0500 (CDT) Subject: [torqueusers] using dependencies and arrays Message-ID: <6608.166.137.140.47.1317218048.squirrel@merle.it.northwestern.edu> I am having a problem with dependencies and job arrays. I've seen several messages on the list about this but no resolution. We are using torque 2.5.8 I first submit several jobs in an array: qsub -l -t 1-3 This returns jobid1[] If I call the job with qsub -l depend:afteranyarray:jobid1[] Then the second job doesn't wait until the job array (jobid1[]) has completed. It starts up about 20 seconds after the various jobs in the job array are started and of course fails since the results of jobid1 aren't ready yet. I've also tried using depend:afterokarray, depend:afterok and depend:afterany. I've also tried submitting the second job with: qsub -W depend:afteranyarray:jobid1[] (as well as the same permutations as above). In this case the second job does hold... forever. When I run checkjob on each job in the array I find they have all completed successfully with an exit status of 0. When I checkjob the held job I get [xxx at quser04 ~]$ checkjob -vvv 1183767 job 1183767 (RM job '1183767.qsched01') AName: xxx.defragment State: Hold Creds:? user:xxx? group:xxx? account:t20213? class:short WallTime:?? 00:00:00 of 3:58:20 SubmitTime: Tue Sep 27 11:33:33 ? (Time Queued? Total: 1:45:28? Eligible: 00:00:05) NodeMatchPolicy: EXACTNODE Total Requested Tasks: 1 Total Requested Nodes: 1 Req[0]? TaskCount: 1? Partition: ALL? NodeCount:? 1 IWD:??????????? /home/xxx UMask:????????? 0000 OutputFile:???? quser04:/home/xxx/./xxx_logs/xxx.defragment.o1183767 ErrorFile:????? quser04:/home/xxx/./xxx_logs/xxx.defragment.e1183767 Partition List: quest1,quest2,questgpu1,SHARED SrcRM:????????? torque? DstRM: torque? DstRMJID: 1183767.qsched01 Submit Args:??? -V -d . -r y -q short -M d-xxx at xxx.edu -N xxx.defragment -m abe -o ./xxx_logs/ -e ./xxx_logs/ -l walltime=14300 -W depend=afteranyarray:1183766[] /home/xxx/tempcmd20332 Flags:????????? RESTARTABLE Attr:?????????? checkpoint StartPriority:? 256 PE:???????????? 1.00 ?NOTE:? job cannot run? (job has hold in place) NOTE:? job violates constraints for partition hyperthread (non-idle state 'Hold') NOTE:? job violates constraints for partition quest1 (non-idle state 'Hold') NOTE:? job violates constraints for partition quest2 (non-idle state 'Hold') NOTE:? job violates constraints for partition questgpu1 (non-idle state 'Hold') NOTE:? job violates constraints for partition pim (non-idle state 'Hold') BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling iteration) I don't really understand what constraints the job is violating and why the dependency isn't working with either -l or -W. Thanks Darren From andre.gemuend at scai.fraunhofer.de Thu Sep 29 02:09:36 2011 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Thu, 29 Sep 2011 10:09:36 +0200 (CEST) Subject: [torqueusers] File staging syntax In-Reply-To: Message-ID: <1e5b02c0-6a48-4bde-9247-964ca25edfea@zimbra.scai.fraunhofer.de> Did you ever find time to reproduce this? It would be nice to know the exact version this changed in, for the bug report on the related software. Greetings Andr? ----- Urspr?ngliche Mail ----- > Hello Ken, > > you just need two stagein or stageout files in one line: > > [andre at gloria pbs]$ cat pbstest > #!/bin/bash > #PBS -S /bin/bash > #PBS -q local > #PBS -W > stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 > #PBS -m n > echo "foo" > [andre at gloria pbs]$ qsub pbstest > qsub: illegal -W value > > [andre at gloria pbs]$ cat pbstest > #!/bin/bash > #PBS -S /bin/bash > #PBS -q local > #PBS -W > stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 > #PBS -m n > echo "foo" > [andre at gloria pbs]$ qsub pbstest > qsub: illegal -W value > > [andre at gloria pbs]$ cat pbstest > #!/bin/bash > #PBS -S /bin/bash > #PBS -q local > #PBS -W stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo > #PBS -W stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 > #PBS -m n > echo "foo" > [andre at gloria pbs]$ qsub pbstest > 435688.tonia.d-grid.scai.fraunhofer.de > > So neither the old, nor the new syntax of specifying multiple files > per -W works anymore. > > gLite CREAM (http://glite.cern.ch/glite-CREAM/) generates these lines > with its wrapper script. So this is basically a bug in that software > (which is easy to solve), but it would have been nice to be notified > of the change. > > Greetings > Andr? > > > Andre > > > > Can you send your qsub or msub line? > > > > Can you send your script as well? > > > > is it possible that the -W syntax changed again between 2.5.5 and > > > 2.5.8? We were using 2.5.5 without problems, but since I upgraded > > > to > > > 2.5.8 yesterday, PBS scripts with more than one file per staging > > > line failed with "illegal -W syntax". I had to change the scripts > > > to > > > use seperate -W lines for every file. I didn't see this in the > > > changelog, or maybe I just missed it? > > > > -- > Andr? Gem?nd > Fraunhofer-Institute for Algorithms and Scientific Computing > andre.gemuend at scai.fraunhofer.de > Tel: +49 2241 14-2193 > /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From l.flis at cyf-kr.edu.pl Thu Sep 29 04:30:41 2011 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Thu, 29 Sep 2011 12:30:41 +0200 Subject: [torqueusers] File staging syntax In-Reply-To: <1e5b02c0-6a48-4bde-9247-964ca25edfea@zimbra.scai.fraunhofer.de> References: <1e5b02c0-6a48-4bde-9247-964ca25edfea@zimbra.scai.fraunhofer.de> Message-ID: <4E8448D1.9010700@cyf-kr.edu.pl> Hello, We hit this issue in Cyfronet after migrating our grid clusters to Torque 2.5.8. File staging in CREAM stopped working so we had to patch things a bit on the CREAM side. Stagein syntax accepted by 2.5.8 requires additional escaped inverted commas: Example: #PBS -W stagein=\'foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2\' As a workaround please specify: PBS_MULTIPLE_STAGING_DIRECTIVE=no in site-info.def and apply attached patch to /opt/glite/bin/pbs_submit.sh I hope that helps Best Regards -- Lukasz Flis ACC Cyfronet AGH > Did you ever find time to reproduce this? It would be nice to know the exact version this changed in, for the bug report on the related software. > > Greetings > Andr? > > ----- Urspr?ngliche Mail ----- >> Hello Ken, >> >> you just need two stagein or stageout files in one line: >> >> [andre at gloria pbs]$ cat pbstest >> #!/bin/bash >> #PBS -S /bin/bash >> #PBS -q local >> #PBS -W >> stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 >> #PBS -m n >> echo "foo" >> [andre at gloria pbs]$ qsub pbstest >> qsub: illegal -W value >> >> [andre at gloria pbs]$ cat pbstest >> #!/bin/bash >> #PBS -S /bin/bash >> #PBS -q local >> #PBS -W >> stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 >> #PBS -m n >> echo "foo" >> [andre at gloria pbs]$ qsub pbstest >> qsub: illegal -W value >> >> [andre at gloria pbs]$ cat pbstest >> #!/bin/bash >> #PBS -S /bin/bash >> #PBS -q local >> #PBS -W stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo >> #PBS -W stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 >> #PBS -m n >> echo "foo" >> [andre at gloria pbs]$ qsub pbstest >> 435688.tonia.d-grid.scai.fraunhofer.de >> >> So neither the old, nor the new syntax of specifying multiple files >> per -W works anymore. >> >> gLite CREAM (http://glite.cern.ch/glite-CREAM/) generates these lines >> with its wrapper script. So this is basically a bug in that software >> (which is easy to solve), but it would have been nice to be notified >> of the change. >> >> Greetings >> Andr? >> >>> Andre >>> >>> Can you send your qsub or msub line? >>> >>> Can you send your script as well? >> >>>> is it possible that the -W syntax changed again between 2.5.5 and >>>> 2.5.8? We were using 2.5.5 without problems, but since I upgraded >>>> to >>>> 2.5.8 yesterday, PBS scripts with more than one file per staging >>>> line failed with "illegal -W syntax". I had to change the scripts >>>> to >>>> use seperate -W lines for every file. I didn't see this in the >>>> changelog, or maybe I just missed it? >> >> >> >> -- >> Andr? Gem?nd >> Fraunhofer-Institute for Algorithms and Scientific Computing >> andre.gemuend at scai.fraunhofer.de >> Tel: +49 2241 14-2193 >> /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: pbs_submit.patch Type: text/x-patch Size: 990 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20110929/7e7d1ae6/attachment.bin From andre.gemuend at scai.fraunhofer.de Thu Sep 29 05:50:57 2011 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Thu, 29 Sep 2011 13:50:57 +0200 (CEST) Subject: [torqueusers] File staging syntax In-Reply-To: <4E8448D1.9010700@cyf-kr.edu.pl> Message-ID: <89241071-54a7-43a0-9187-e43ab2bbaeb3@zimbra.scai.fraunhofer.de> Hey Lukasz, I already patched our local version of pbs_submit.sh (though differently than you). That was not my point. I just didn't see mention of this in the changelog. Greetings Andr? ----- Urspr?ngliche Mail ----- > Hello, > > We hit this issue in Cyfronet after migrating our grid clusters to > Torque 2.5.8. > File staging in CREAM stopped working so we had to patch things a bit > on > the CREAM side. > > Stagein syntax accepted by 2.5.8 requires additional escaped inverted > commas: > > Example: > #PBS -W > stagein=\'foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2\' > > > > As a workaround please specify: > PBS_MULTIPLE_STAGING_DIRECTIVE=no > in site-info.def and apply attached patch to > /opt/glite/bin/pbs_submit.sh > > I hope that helps > > Best Regards > -- > Lukasz Flis > ACC Cyfronet AGH > > > > Did you ever find time to reproduce this? It would be nice to know > > the exact version this changed in, for the bug report on the > > related software. > > > > Greetings > > Andr? > > > > ----- Urspr?ngliche Mail ----- > >> Hello Ken, > >> > >> you just need two stagein or stageout files in one line: > >> > >> [andre at gloria pbs]$ cat pbstest > >> #!/bin/bash > >> #PBS -S /bin/bash > >> #PBS -q local > >> #PBS -W > >> stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 > >> #PBS -m n > >> echo "foo" > >> [andre at gloria pbs]$ qsub pbstest > >> qsub: illegal -W value > >> > >> [andre at gloria pbs]$ cat pbstest > >> #!/bin/bash > >> #PBS -S /bin/bash > >> #PBS -q local > >> #PBS -W > >> stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo,foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 > >> #PBS -m n > >> echo "foo" > >> [andre at gloria pbs]$ qsub pbstest > >> qsub: illegal -W value > >> > >> [andre at gloria pbs]$ cat pbstest > >> #!/bin/bash > >> #PBS -S /bin/bash > >> #PBS -q local > >> #PBS -W stagein=foo at gloria.d-grid.scai.fraunhofer.de:/tmp/foo > >> #PBS -W stagein=foo2 at gloria.d-grid.scai.fraunhofer.de:/tmp/foo2 > >> #PBS -m n > >> echo "foo" > >> [andre at gloria pbs]$ qsub pbstest > >> 435688.tonia.d-grid.scai.fraunhofer.de > >> > >> So neither the old, nor the new syntax of specifying multiple > >> files > >> per -W works anymore. > >> > >> gLite CREAM (http://glite.cern.ch/glite-CREAM/) generates these > >> lines > >> with its wrapper script. So this is basically a bug in that > >> software > >> (which is easy to solve), but it would have been nice to be > >> notified > >> of the change. > >> > >> Greetings > >> Andr? > >> > >>> Andre > >>> > >>> Can you send your qsub or msub line? > >>> > >>> Can you send your script as well? > >> > >>>> is it possible that the -W syntax changed again between 2.5.5 > >>>> and > >>>> 2.5.8? We were using 2.5.5 without problems, but since I > >>>> upgraded > >>>> to > >>>> 2.5.8 yesterday, PBS scripts with more than one file per staging > >>>> line failed with "illegal -W syntax". I had to change the > >>>> scripts > >>>> to > >>>> use seperate -W lines for every file. I didn't see this in the > >>>> changelog, or maybe I just missed it? > >> > >> > >> > >> -- > >> Andr? Gem?nd > >> Fraunhofer-Institute for Algorithms and Scientific Computing > >> andre.gemuend at scai.fraunhofer.de > >> Tel: +49 2241 14-2193 > >> /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > > > > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From knielson at adaptivecomputing.com Thu Sep 29 08:59:02 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 29 Sep 2011 08:59:02 -0600 (MDT) Subject: [torqueusers] File staging syntax In-Reply-To: <1e5b02c0-6a48-4bde-9247-964ca25edfea@zimbra.scai.fraunhofer.de> Message-ID: <09eea517-55d2-41ed-a7a3-6f73587943db@mail> ----- Original Message ----- > From: "Andr? Gem?nd" > To: "Torque Users Mailing List" > Sent: Thursday, September 29, 2011 2:09:36 AM > Subject: Re: [torqueusers] File staging syntax > > Did you ever find time to reproduce this? It would be nice to know > the exact version this changed in, for the bug report on the related > software. > > Greetings > Andr? > Andr?, I have not yet had time to reproduce this. I did look through the change log and there are two suspects. One is in 2.5.6, a fix for Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. That is as far as I am right now. I will try to get to this as soon as I can. Ken From wannes.van.causbroeck at imdc.be Thu Sep 29 06:39:11 2011 From: wannes.van.causbroeck at imdc.be (Wannes Van Causbroeck) Date: Thu, 29 Sep 2011 14:39:11 +0200 Subject: [torqueusers] numa problems Message-ID: <9EBFECC459B64D448BC1AF3A7CFAB40588CE80@imdc-mail.imdc.local> Hello everyone! I sent this message before, but i don't know if it arrived correctly, so i'll try again. (sorry if this is a dupe) we're just starting out with torque, but we've run into a problem. We have a 48-core AMD system (4 sockets with 12 cores each). The linux system sees this as 8 nodes with 6 cores each. I've tried compiling torque 3.02 with --enable-cpuset and --enable-numa-support. (i also tried without cpuset, but the result was the same, i even got an error telling me i had to mount /dev/cpuset, even without this switch???). Anyway, our mom.layout looks like this: cpus=0,4,8,12,16,20 mem=0 cpus=24,28,32,36,40,44 mem=1 cpus=1,5,9,13,17,21 mem=2 cpus=25,29,33,37,31,45 mem=3 cpus=2,6,10,14,18,22 mem=4 cpus=26,30,34,38,42,46 mem=5 cpus=3,7,11,15,19,23 mem=6 cpus=27,31,35,39,43,47 mem=7 it's a bit strange, but this is how it's reported by linux. When i start a job with these parameters: #PBS -N JobMPI #PBS -l nodes=1:ppn=4 #PBS -m abe It starts 4 processes in a really weird way. Sometimes he uses core 0,1,2,3, sometimes 2 processes get run on one core, then it jumps to core 24, etc. the system takes a big performance hit when the processes aren't run on the cores sharing the same memory, so we want to lock the tasks on the same node. What am i doing wrong? Greetings, Wannes From samuel at unimelb.edu.au Thu Sep 29 17:18:08 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 30 Sep 2011 09:18:08 +1000 Subject: [torqueusers] numa problems In-Reply-To: <9EBFECC459B64D448BC1AF3A7CFAB40588CE80@imdc-mail.imdc.local> References: <9EBFECC459B64D448BC1AF3A7CFAB40588CE80@imdc-mail.imdc.local> Message-ID: <4E84FCB0.5040002@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 29/09/11 22:39, Wannes Van Causbroeck wrote: > it's a bit strange, but this is how it's reported by linux. Not directly relevant, but I do find that the utils/lstopo command which comes as part of hwloc from the Open-MPI project is extremely useful for visualising the NUMA architecture of a system (and the 1.3 pre-releases include libpci support so you can see where PCI slots, NIC's, disks etc hang off too). http://www.open-mpi.org/projects/hwloc/ It can do text dumps, PNG's, XML, etc. If you've got it compiled with X11 support then it'll throw up a window showing the graphical layout. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6E/LAACgkQO2KABBYQAh9YPwCfc7HFC3IjtPB7TYNXC7rd8JMh iJQAnRyQIV8AqPwD2lKwVS/yHQwDv22P =CrCm -----END PGP SIGNATURE----- From knielson at adaptivecomputing.com Thu Sep 29 17:27:15 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 29 Sep 2011 17:27:15 -0600 (MDT) Subject: [torqueusers] Job arrays don't show up in pbstop output In-Reply-To: <601D8486-C76B-44DF-8959-E23AFEB3F2BF@nyu.edu> Message-ID: <8adb285b-ad9c-4fe3-aeca-83d77f290487@mail> ----- Original Message ----- > From: "Sreedhar Manchu" > To: "Torque Users Mailing List" > Sent: Tuesday, September 27, 2011 9:32:54 AM > Subject: [torqueusers] Job arrays don't show up in pbstop output > > Hi, > > Since we upgraded to torque 2.5.8, job arrays are not showing up in > pbstop output. I don't have that much expertise in perl to change > the pbstop code. If anyone has fixed this problem or could tell me > pointers in fixing it, I would greatly appreciate it. > > I have looked for latest version of pbstop but it looks like I do > have the latest one (pbstop-4.16-10.el5 and perl-PBS-0.33-10.el5). > > Thanks in advance, > Sreedhar. Hi all, I e-mailed Garrick and he sent me the latest code he had. He said he did not update it to include the new job arrays. Someone with perl expertise may want to do that. Garrick suggested we take this over. I suggest we add it to the contrib directory and make it part of the distribution. Regards Ken -------------- next part -------------- A non-text attachment was scrubbed... Name: perl-PBS-0.33.tar.gz Type: application/x-compressed-tar Size: 90264 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20110929/a6dffb56/attachment-0001.bin From Gareth.Williams at csiro.au Thu Sep 29 18:00:15 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 30 Sep 2011 10:00:15 +1000 Subject: [torqueusers] Job arrays don't show up in pbstop output In-Reply-To: <8adb285b-ad9c-4fe3-aeca-83d77f290487@mail> References: <601D8486-C76B-44DF-8959-E23AFEB3F2BF@nyu.edu> <8adb285b-ad9c-4fe3-aeca-83d77f290487@mail> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE5D@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Ken Nielson [mailto:knielson at adaptivecomputing.com] > Sent: Friday, 30 September 2011 9:27 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Job arrays don't show up in pbstop output > > ----- Original Message ----- > > From: "Sreedhar Manchu" > > To: "Torque Users Mailing List" > > Sent: Tuesday, September 27, 2011 9:32:54 AM > > Subject: [torqueusers] Job arrays don't show up in pbstop output > > > > Hi, > > > > Since we upgraded to torque 2.5.8, job arrays are not showing up in > > pbstop output. I don't have that much expertise in perl to change the > > pbstop code. If anyone has fixed this problem or could tell me > > pointers in fixing it, I would greatly appreciate it. > > > > I have looked for latest version of pbstop but it looks like I do > have > > the latest one (pbstop-4.16-10.el5 and perl-PBS-0.33-10.el5). > > > > Thanks in advance, > > Sreedhar. > > Hi all, > > I e-mailed Garrick and he sent me the latest code he had. He said he > did not update it to include the new job arrays. Someone with perl > expertise may want to do that. > > Garrick suggested we take this over. I suggest we add it to the contrib > directory and make it part of the distribution. > > Regards > > Ken Fantastic. Ken I'll send you our array job capable pbstop version (just the script - we did not need to change perl-PBS). It has some minor site specific info and the copyright info should really be updated to reflect it is a derivative work (under the Licensing conditions specified in the pbstop script) and that not all the copyright lies with USC (perhaps the header could do with a contributors section). Also changing the contact information and the information on where to get updates would probably be best. That said, I don't really mind if you post our version as-is. I already provided it to Sreedhar. Cheers, Gareth From dbeer at adaptivecomputing.com Fri Sep 30 09:14:45 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 30 Sep 2011 09:14:45 -0600 (MDT) Subject: [torqueusers] numa problems In-Reply-To: <9EBFECC459B64D448BC1AF3A7CFAB40588CE80@imdc-mail.imdc.local> Message-ID: <6fe1606a-e98b-4d30-9093-8e2d1d0a3bad@mail> ----- Original Message ----- > Hello everyone! > I sent this message before, but i don't know if it arrived correctly, > so i'll try again. (sorry if this is a dupe) > > > we're just starting out with torque, but we've run into a problem. We > have a 48-core AMD system (4 sockets with 12 cores each). The linux > system sees this as 8 nodes with 6 cores each. > I've tried compiling torque 3.02 with --enable-cpuset and > --enable-numa-support. (i also tried without cpuset, but the result > was > the same, i even got an error telling me i had to mount /dev/cpuset, > even without this switch???). Numa support uses cpusets for its implementation, so yes, you'll get the same result whether or not you use the --enable-cpuset switch. You will definitely need to mount cpusets in order to get things working. > Anyway, our mom.layout looks like this: > > cpus=0,4,8,12,16,20 mem=0 > cpus=24,28,32,36,40,44 mem=1 > cpus=1,5,9,13,17,21 mem=2 > cpus=25,29,33,37,31,45 mem=3 > cpus=2,6,10,14,18,22 mem=4 > cpus=26,30,34,38,42,46 mem=5 > cpus=3,7,11,15,19,23 mem=6 > cpus=27,31,35,39,43,47 mem=7 > > it's a bit strange, but this is how it's reported by linux. > When i start a job with these parameters: > > #PBS -N JobMPI > #PBS -l nodes=1:ppn=4 > #PBS -m abe > > It starts 4 processes in a really weird way. Sometimes he uses core > 0,1,2,3, sometimes 2 processes get run on one core, then it jumps to > core 24, etc. > the system takes a big performance hit when the processes aren't run > on > the cores sharing the same memory, so we want to lock the tasks on > the > same node. > > What am i doing wrong? I second Chris's suggestion - please send in the output of lstopo and we'll see what to do from there. I do wonder about your ordering - I'm not sure that TORQUE 3.0.* is well-equipped to handle a system with that kind of layout, but send in your lstopo output and we'll help you as much as we can. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1656 S. East Bay Blvd. Suite #300 Provo, UT 84606 From siegert at sfu.ca Fri Sep 30 12:43:53 2011 From: siegert at sfu.ca (Martin Siegert) Date: Fri, 30 Sep 2011 11:43:53 -0700 Subject: [torqueusers] limit the number of jobs a user can submit Message-ID: <20110930184353.GF21971@stikine.sfu.ca> Hi, I know this has been discussed before, but I believe an important aspect has been overlooked: Moab has a limit on the number of jobs it can handle: the MAXJOB parameter: "Specifies the maximum number of simultaneous jobs which can be evaluated by the scheduler. If additional jobs are submitted to the resource manager, Moab will ignore these jobs until previously submitted jobs complete." This allows for a trivial denial-of-service attack: Simply submit a job array with at least MAXJOB+1 elements. After that moab will disregard all further jobs for scheduling even if they have a much higher priority than the array job elements. I have not yet found a way of preventing this DoS attack. The most logical solution to me would be to expand the "max_user_queuable" specification to allow for a server wide setting, not just a per queue setting, i.e., set server max_user_queuable = 1000 Is that a feasible solution? (and, yes, I'd like this limit to be in torque and not in moab because the user will get an immediate response from qsub). Cheers, Martin -- Martin Siegert Simon Fraser University