From jwbacon at tds.net Thu Dec 1 07:35:02 2011 From: jwbacon at tds.net (Jason Bacon) Date: Thu, 01 Dec 2011 08:35:02 -0600 Subject: [torqueusers] Cannot connect to default server host 'node32' - check pbs_server daemon. In-Reply-To: <4ED722D6.10404@yahoo.com.cn> References: <4ED722D6.10404@yahoo.com.cn> Message-ID: <4ED79096.6070900@tds.net> Ping only tells you that the OS has a working network connection. qstat requires PBS daemons listening on that connection, so try restarting the PBS daemons that are configured to run on that node. Regards, -J On 12/01/11 00:46, Hongsheng Zhao wrote: > Hi all, > > When I issue the qstat -a command on my hpc, I meet the following errors: > > ---------- > zhaohongsheng at node32:~> qstat -a > Cannot connect to default server host 'node32' - check pbs_server daemon. > qstat: cannot connect to server node32 (errno=111) Connection refused > > But the ping echo reply from the node32 is as follows: > > zhaohongsheng at node32:~> ping node32 > PING node32.nxu.edu.cn (202.201.128.36) 56(84) bytes of data. > 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=1 ttl=64 > time=0.016 ms > 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=2 ttl=64 > time=0.010 ms > 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=3 ttl=64 > time=0.010 ms > 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=4 ttl=64 > time=0.015 ms > 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=5 ttl=64 > time=0.012 ms > 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=6 ttl=64 > time=0.013 ms > > --- node32.nxu.edu.cn ping statistics --- > 6 packets transmitted, 6 received, 0% packet loss, time 5000ms > rtt min/avg/max/mdev = 0.010/0.012/0.016/0.004 ms > zhaohongsheng at node32:~> > ---------- > > Any hints on this issue? > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From glen.beane at gmail.com Thu Dec 1 07:41:18 2011 From: glen.beane at gmail.com (Glen Beane) Date: Thu, 1 Dec 2011 09:41:18 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: On Wed, Nov 30, 2011 at 5:38 PM, Ricardo Rom?n Brenes wrote: > Thank you so much for your help =)?yet I still have matters to discuss. > > > On Wed, Nov 30, 2011 at 4:22 PM, Gustavo Correa > wrote: >> >> You don't have 8 CPUs of type 'uno'. >> This seems to conflict with your mpirun command with -np=8. >> You need to match the number of processors you request from Torque and >> the number of processes you launch with mpirun. >> > > > 1. Why there has to be a match between processors and processes? i could run > 1024 process in 1 processor (without torque). Requesting 2 nodes i could > spawn 10000 processes... you can certainly oversubscribe, however that makes the scheduler and resource manager's jobs harder if you are essentially "lying" to them by using more cores than you request (if you are scheduling the entire node it doesn't matter so much, but on my cluster we have 32 core nodes and it is fairly common to have a node running jobs from more than one user). You can get kicked off a shared cluster for doing this. > >> >> Also, you wrote: >> >> #PPS -q uno >> >> Is this a typo in your email or in your Torque submission script? >> It should be: >> >> #PBS -q uno >> >> In addition, your PBS script doesn't request nodes, something like >> #PBS -l nodes=1:ppn=2 >> I suppose it will use the default for the queue uno. >> However, your qmgr configuation doesn't set a default number of nodes to >> use, >> either for the queues or for the server itself. >> >> You could do: >> qmgr -c 'set queue uno resources_default.nodes = 1' >> and likewise for queue dos. >> you should request a number of nodes and processors, otherwise you will get a default (for the queue, if one is set). If there is no default set I think TORQUE and Maui will assume you want one core. > >> >> More important, is your mpi [and mpiexec] built with Torque support? >> For instance, OpenMPI can be built with Torque support, so that it >> will use the nodes provided by Torque to run the job. >> However, stock packaged MPIs from yum or apt-get are probably not >> integrated with Torque. >> You would need to build it from source, which is not really hard. >> >> If you use an mpi that is not integrated with Torque, you need to pass to >> mpirun/mpiexec >> the file created by Torque with the node list. >> The file name is held by the environment variable $PBS_NODEFILE. >> The syntax vary depending on which mpi you are using, check your mpirun >> man page, >> but should be something like: >> >> mpirun -hostfile $PBS_NODEFILE -np 2 ?./a.out >> > > 3. My MPICH2 is version 1.2.1p1. I dont recall if i compiled it with torque > support. Even so i dont' have a vairable?$PBS_NODEFILE. (doing a > "echo?$PBS_NODEFILE" returns an empty line). PBS_NODEFILE will only be defined in a batch job. Submit a simple job to TORQUE that only cats $PBS_NODEFILE and see what it says. If you don't compile MPICH2 with TORQUE/PBS support then you will need to pass mpiexec a hostfile. Otherwise it might use a default hostfile if you set one (I don't remember, I haven't used MPICH in years). If your MPI implementation has TORQUE integration then you don't pass a hostfile since it gets this information directly from TORQUE. I would HIGHLY HIGHLY recommend using a MPI process launcher that is integrated with TORQUE. For MPICH2 you can use OSC's drop in replacement mpiexec, and I think MPICH2 might also offer native support (maybe a compile time option) but I'm not 100% sure. The benefit of this are 1) mpirun/mpiexec can get the host list directly from TORQUE, 2) it uses the TM APIs to launch the remote processes rather than something like ssh. This means every process is under control of TORQUE. No zombie processes, and things like CPU time measured for the job are correct. > > 4. I dont know if this is my problem or not but you talk about mpirun and > mpiexec like if they were the same, yet i have used mpiexec most of the time > and im not sure about the similiarities (or differences). You asked if my > MPIEXEC is built with torque but a few lines below you mention MPIRUN older MPI implementations called the job launcher mpirun. Later the MPI standard specified that the mpi job launcher be called mpiexec. Some MPI implementations provide both for compatibility (they might actually be the same command), but in this thread you can assume they are the same thing. Just use whichever is appropriate for your installation. From WJEdsall at dow.com Thu Dec 1 10:46:58 2011 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Thu, 1 Dec 2011 12:46:58 -0500 Subject: [torqueusers] failover resources Message-ID: <52CD990A674498429E6A7B4FCAE3F7D307C9F0A6@USMDLMDOWX025.dow.com> Hello, Is it possible to use the resource_list to request a secondary resource, if the first resource is not available? For example resource 1: -l nodes=1:infini:ppn=8 resource 2: -l nodes=1:infini2:ppn=8 I've tried the following and it defaults to the second resource: -l nodes=1:infini:ppn=8,nodes=1:infini2:ppn=8 Thanks in advance Will -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111201/54366a4d/attachment.html From samuel at unimelb.edu.au Thu Dec 1 17:32:01 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 02 Dec 2011 11:32:01 +1100 Subject: [torqueusers] Cannot connect to default server host 'node32' - check pbs_server daemon. In-Reply-To: <4ED722D6.10404@yahoo.com.cn> References: <4ED722D6.10404@yahoo.com.cn> Message-ID: <4ED81C81.1010207@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/12/11 17:46, Hongsheng Zhao wrote: > Any hints on this issue? Is node32 really where your pbs_server is running ? If not, does /var/spool/torque/server_name contain the correct name for the host running pbs_server ? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7YHIEACgkQO2KABBYQAh9ixACgkFJyY3ZJtAs9Ne8H9iifnDtd buQAoIhsL8usCRi/wvw8SwScr3ecaUbN =0rSg -----END PGP SIGNATURE----- From zhaohscas at yahoo.com.cn Thu Dec 1 20:53:18 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Fri, 02 Dec 2011 11:53:18 +0800 Subject: [torqueusers] Cannot connect to default server host 'node32' - check pbs_server daemon. In-Reply-To: <4ED81C81.1010207@unimelb.edu.au> References: <4ED722D6.10404@yahoo.com.cn> <4ED81C81.1010207@unimelb.edu.au> Message-ID: <4ED84BAE.5090803@yahoo.com.cn> On 12/02/2011 08:32 AM, Christopher Samuel wrote: > Is node32 really where your pbs_server is running ? Thanks for your hints. But I what's the command used to judge the node on which the pbs_server is running? > > If not, does /var/spool/torque/server_name contain the > correct name for the host running pbs_server ? In my case, I use a customized version of torque by Drawing Company, i.e., named as gridview. I've found that the following file should be the equivalent one to which you mentioned above: /opt/gridview/pbs/dispatcher/server_name. You can see the contents of it from the following command: ----- zhaohongsheng at node32:~> cat /opt/gridview/pbs/dispatcher/server_name node32 ------- Finally, I've solved the issue by restarting the pbs_server daemon as follows: ------------- node32:~ # node32:~ # /sbin/service pbs_server restart Shutting down dispatcher Server: action OK Starting dispatcher Server: action OK or it can be done as follows: node32:~ # service pbs_server restart Shutting down dispatcher Server: action OK Starting dispatcher Server: action OK ------------- But, I cann't figure out the differences between the above two commands. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Thu Dec 1 22:16:55 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Fri, 02 Dec 2011 13:16:55 +0800 Subject: [torqueusers] My issue when changing the nodes list for a queue. Message-ID: <4ED85F47.7090005@yahoo.com.cn> Hi all, I meet the following issue: On my hpc, I've a queue named as default. Now, I've made some adjustment on the node's list of this queue. The new node's definition file for this queue is as follows: ---------------- node32:/opt/gridview/pbs/dispatcher/server_priv/acl_hosts # cat default node29 node28 node27 node26 node25 node24 node23 node33 node22 node21 node30 ------------------ After the above modification, I've found I must restart the pbs_server in order to let the changes take effect, i.e., by running the following command: node32:/opt/gridview/pbs/dispatcher/server_priv/acl_hosts # service pbs_server restart But, this will have influence on all of the running jobs on this hpc. Now, I want to know whether it is possible to let my changes take effect immediately without restarting the pbs_server? Any hints will be highly appreciated. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Thu Dec 1 22:26:26 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Fri, 02 Dec 2011 13:26:26 +0800 Subject: [torqueusers] Cann't generate the PBS log files. Message-ID: <4ED86182.2000202@yahoo.com.cn> Hi all, In my pbs script, I use the following codes to generate the PBS log files: -------- # Set filenames for PBS to log standard error and standard output. #PBS -o stdout #PBS -e stderr ... #Start job from the directory it was submitted cd $PBS_O_WORKDIR -------- After the job has been executed successfully, I cann't find the stdout and stderr files under the job's directory. Any hints on this issue? Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From samuel at unimelb.edu.au Thu Dec 1 22:45:05 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 02 Dec 2011 16:45:05 +1100 Subject: [torqueusers] Cannot connect to default server host 'node32' - check pbs_server daemon. In-Reply-To: <4ED84BAE.5090803@yahoo.com.cn> References: <4ED722D6.10404@yahoo.com.cn> <4ED81C81.1010207@unimelb.edu.au> <4ED84BAE.5090803@yahoo.com.cn> Message-ID: <4ED865E1.2000000@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/12/11 14:53, Hongsheng Zhao wrote: > Thanks for your hints. But I what's the command used to judge the node > on which the pbs_server is running? It'll be whereever you installed it.. you pick one node to be your pbs_server (generally it's on the management node for the cluster). > ----- > zhaohongsheng at node32:~> cat /opt/gridview/pbs/dispatcher/server_name > node32 > ------- OK - that says it should be running on that node. Is that what you intended ? > Finally, I've solved the issue by restarting the pbs_server daemon as > follows: Great, glad to hear that's working! > node32:~ # /sbin/service pbs_server restart > node32:~ # service pbs_server restart [...] > But, I cann't figure out the differences between the above two commands. There is no difference, it's the same command. Just that you've given the full path to the service command in the first one. Sounds like you might be new to Linux systems administration, might be worth finding some good tutorials on it too! Best of luck, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7YZeEACgkQO2KABBYQAh+jbACfeqIFTxsElsOc2G0y0cWlI6/T H9YAoIA/FEJgXo2GWqx7u5BJReFAwZ5J =gQT6 -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Dec 1 22:46:00 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 02 Dec 2011 16:46:00 +1100 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4ED85F47.7090005@yahoo.com.cn> References: <4ED85F47.7090005@yahoo.com.cn> Message-ID: <4ED86618.7090105@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/12/11 16:16, Hongsheng Zhao wrote: > But, this will have influence on all of the running jobs on this hpc. It shouldn't do, they should be able to carry on running quite happily if you restart the pbs_server. Best of luck, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7YZhgACgkQO2KABBYQAh8ZIQCePDj22LtASDBasHwmBZKG/flV slEAn13bSbeiy3PQcXyTisOKSa/B9GT6 =9aFS -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Dec 1 22:48:53 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 02 Dec 2011 16:48:53 +1100 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED86182.2000202@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> Message-ID: <4ED866C5.7020803@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/12/11 16:26, Hongsheng Zhao wrote: > After the job has been executed successfully, I cann't find the stdout > and stderr files under the job's directory. Any hints on this issue? If the users home directories are all under /home, and if /home is mounted across the cluster (say via NFS, Lustre, GPFS or similar) then you can put: $usecp *:/home /home in your pbs_mom.conf file to tell pbs_mom to not use scp to copy files back but to just cp them instead. If you do need to use scp because there is no shared /home directory then that's a bit more complicated, but only worth investigating in that (unlikely) scenario. All the best, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7YZsUACgkQO2KABBYQAh8dsACfUcfJVXjkbj1it1+2sQq1ihme 8bMAnicVG40iymyjIuQSOuyDinb+uWu8 =75lp -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Thu Dec 1 23:39:31 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 02 Dec 2011 17:39:31 +1100 Subject: [torqueusers] reporting of max cpus found available (not offline) In-Reply-To: <4EC8C40B.2050002@cern.ch> References: <4EC8C40B.2050002@cern.ch> Message-ID: <4ED872A3.4010406@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/11/11 20:10, Adrian Sevcenco wrote: > Hi! i see that qstat at he end gives a very useful information about the > available (online used and unused) cores. Which version ? Or do you mean showq ? > there is no reporting command that would report this (only what is configured) > Is there a way or can someone point me to relevant portion of API that > would give me exactly this? My best guess at the moment would be to parse the XML output of pbsnodes - -x - that will give you what's configured (as NP) and you can work out how many are in use by counting the number of vnodes assigned to jobs on each node. As for the API, I guess pbs_statnode(3) will do the same, it looks like you can even tell it to only return the "np" and "jobs" attributes for all nodes. Disclaimer: I am not a programmer. :-) cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7YcqMACgkQO2KABBYQAh8xvgCfXAKXnb9HHV2ZQ7j2ks10qaGG kVIAnj9Gm3HwWJv3zdi8gikXouTDJgpe =9YIg -----END PGP SIGNATURE----- From zhaohscas at yahoo.com.cn Fri Dec 2 00:27:58 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Fri, 02 Dec 2011 15:27:58 +0800 Subject: [torqueusers] Cannot connect to default server host 'node32' - check pbs_server daemon. In-Reply-To: <4ED865E1.2000000@unimelb.edu.au> References: <4ED722D6.10404@yahoo.com.cn> <4ED81C81.1010207@unimelb.edu.au> <4ED84BAE.5090803@yahoo.com.cn> <4ED865E1.2000000@unimelb.edu.au> Message-ID: <4ED87DFE.2040206@yahoo.com.cn> On 12/02/2011 01:45 PM, Christopher Samuel wrote: >> Thanks for your hints. But I what's the command used to judge the node >> > on which the pbs_server is running? > It'll be whereever you installed it.. you pick one node to be your > pbs_server (generally it's on the management node for the cluster). Yes, we use the node32 as the management node for the cluster. Thanks a lot for your explanations ;-) > >> > ----- >> > zhaohongsheng at node32:~> cat /opt/gridview/pbs/dispatcher/server_name >> > node32 >> > ------- > OK - that says it should be running on that node. Is that what you > intended ? Aha, I'm the user of this hpc, and the hpc is pre-deployed by the vendor ;-) > >> > Finally, I've solved the issue by restarting the pbs_server daemon as >> > follows: > Great, glad to hear that's working! > >> > node32:~ # /sbin/service pbs_server restart >> > node32:~ # service pbs_server restart > [...] >> > But, I cann't figure out the differences between the above two commands. > There is no difference, it's the same command. Just that you've given > the full path to the service command in the first one. Sounds like you > might be new to Linux systems administration, might be worth finding > some good tutorials on it too! Thanks again for your suggestions. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Fri Dec 2 00:32:09 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Fri, 02 Dec 2011 15:32:09 +0800 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4ED86618.7090105@unimelb.edu.au> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> Message-ID: <4ED87EF9.7070306@yahoo.com.cn> On 12/02/2011 01:46 PM, Christopher Samuel wrote: >> But, this will have influence on all of the running jobs on this hpc. > It shouldn't do, they should be able to carry on running quite happily > if you restart the pbs_server. Do you mean it won't have any influence on all of the running jobs at all? I'm not a English native speaker, sorry for my naive re-asking. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Fri Dec 2 00:41:18 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Fri, 02 Dec 2011 15:41:18 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED866C5.7020803@unimelb.edu.au> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> Message-ID: <4ED8811E.6060706@yahoo.com.cn> On 12/02/2011 01:48 PM, Christopher Samuel wrote: >> After the job has been executed successfully, I cann't find the stdout >> > and stderr files under the job's directory. Any hints on this issue? > If the users home directories are all under /home, and if /home is > mounted across the cluster (say via NFS, Lustre, GPFS or similar) then > you can put: > > $usecp *:/home /home > > in your pbs_mom.conf file to tell pbs_mom to not use scp to copy files > back but to just cp them instead. I've try to find the pbs_mom.conf on the management node of my cluster but failed. Any hints? > > If you do need to use scp because there is no shared /home directory > then that's a bit more complicated, but only worth investigating in that > (unlikely) scenario. Thanks a lot for quickly and helpful reply ;-) Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Fri Dec 2 01:23:36 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Fri, 02 Dec 2011 16:23:36 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED8811E.6060706@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> Message-ID: <4ED88B08.1080807@yahoo.com.cn> On 12/02/2011 03:41 PM, Hongsheng Zhao wrote: > On 12/02/2011 01:48 PM, Christopher Samuel wrote: >>> >> After the job has been executed successfully, I cann't find the stdout >>>> >> > and stderr files under the job's directory. Any hints on this issue? >> > If the users home directories are all under /home, and if /home is >> > mounted across the cluster (say via NFS, Lustre, GPFS or similar) then >> > you can put: >> > >> > $usecp *:/home /home >> > >> > in your pbs_mom.conf file to tell pbs_mom to not use scp to copy files >> > back but to just cp them instead. > I've try to find the pbs_mom.conf on the management node of my cluster > but failed. Any hints? > After some test, I've found the following clues but failed to solve the issue. Let me describe it as follows: 1- After the job has been executed, I'll always receive a message from within my remote terminal as the following: --------- You have new mail in /var/mail/zhaohongsheng --------- So, I try to see the content of this file: ------------ zhaohongsheng at node32:~/work/Dr.Zhao/castep_test> tail -20 /var/mail/zhaohongsheng PBS Job Id: 744.node32.nxu.edu.cn Job Name: ZnO Exec host: node29/7+node29/6+node29/5+node29/4+node29/3+node29/2+node29/1+node29/0+node33/15+node33/14+node33/13+node33/12+node33/11+node33/10+node33/9+node33/8+node33/7+node33/6+node33/5+node33/4+node33/3+node33/2+node33/1+node33/0 An error has occurred processing your job, see below. Post job file processing error; job 744.node32.nxu.edu.cn on host node29/7+node29/6+node29/5+node29/4+node29/3+node29/2+node29/1+node29/0+node33/15+node33/14+node33/13+node33/12+node33/11+node33/10+node33/9+node33/8+node33/7+node33/6+node33/5+node33/4+node33/3+node33/2+node33/1+node33/0 Unable to copy file /opt/gridview/pbs/dispatcher/spool/744.node32.nxu.edu.cn.OU to zhaohongsheng at node32:/public/home/zhaohongsheng/work/Dr.Zhao/castep_test/stdout >>> error from copy ssh: connect to host node32 port 22: Network is unreachable lost connection >>> end error output Output retained on that host in: /opt/gridview/pbs/dispatcher/undelivered/744.node32.nxu.edu.cn.OU Unable to copy file /opt/gridview/pbs/dispatcher/spool/744.node32.nxu.edu.cn.ER to zhaohongsheng at node32:/public/home/zhaohongsheng/work/Dr.Zhao/castep_test/stderr >>> error from copy ssh: connect to host node32 port 22: Network is unreachable lost connection >>> end error output Output retained on that host in: /opt/gridview/pbs/dispatcher/undelivered/744.node32.nxu.edu.cn.ER ------------- Based on the above logfile, it looks like I haven't the anonymous ssh access privilege to node32. But I've set the ssh accessing to the node32 without using password, see following for detail: ---------- zhaohongsheng at node32:~/work/Dr.Zhao/castep_test> ssh node32 Last login: Sat Dec 3 01:11:22 2011 from 202.201.128.36 zhaohongsheng at node32:~> ------------- Any hints on this strange issue will be highly appreciated. Thanks in advance. Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From shaomin.hu at gmail.com Thu Dec 1 12:43:34 2011 From: shaomin.hu at gmail.com (Shaomin Hu) Date: Thu, 1 Dec 2011 14:43:34 -0500 Subject: [torqueusers] pbsnodes still show node state=free with all np assigned Message-ID: We are running Torque v3.0.2. We run a 5-node job on nodes carter-a631, a630, a629, a628 and a615. We are running Maui scheduler. All 16-core on these nodes are assigned to this job. The state on nodes a631, a629, a628 and a615 all show job-exclusive, but on node carter-a630 still shows as state=free. [root at carter-adm accounting]# qstat -a -n1 carter-adm.rcac.purdue.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 1051.carter-adm. mluisier workq P6400 -- 400 640 -- 02:00 Q -- -- 1622.carter-adm. knagara workq STDIN 6957 5 80 -- 04:00 R 02:57 carter-a631/15+carter-a631/14+carter-a631/13+carter-a631/12+carter-a631/11+carter-a631/10+carter-a631/9+carter-a631/8+carter-a631/7+carter-a631/6+carter-a631/5+carter-a631/4+carter-a631/3+carter-a631/2+carter-a631/1+carter-a631/0+carter-a630/15+carter-a630/14+carter-a630/13+carter-a630/12+carter-a630/11+carter-a630/10+carter-a630/9+carter-a630/8+carter-a630/7+carter-a630/6+carter-a630/5+carter-a630/4+carter-a630/3+carter-a630/2+carter-a630/1+carter-a630/0+carter-a629/15+carter-a629/14+carter-a629/13+carter-a629/12+carter-a629/11+carter-a629/10+carter-a629/9+carter-a629/8+carter-a629/7+carter-a629/6+carter-a629/5+carter-a629/4+carter-a629/3+carter-a629/2+carter-a629/1+carter-a629/0+carter-a628/15+carter-a628/14+carter-a628/13+carter-a628/12+carter-a628/11+carter-a628/10+carter-a628/9+carter-a628/8+carter-a628/7+carter-a628/6+carter-a628/5+carter-a628/4+carter-a628/3+carter-a628/2+carter-a628/1+carter-a628/0+carter-a615/15+carter-a615/14+carter-a615/13+carter-a615/12+carter-a615/11+carter-a615/10+carter-a615/9+carter-a615/8+carter-a615/7+carter-a615/6+carter-a615/5+carter-a615/4+carter-a615/3+carter-a615/2+carter-a615/1+carter-a615/0 1625.carter-adm. hu8 workq submit.pbs -- 400 640 -- 04:00 Q -- -- [root at carter-adm accounting]# qstat -f 1622 Job Id: 1622.carter-adm.rcac.purdue.edu Job_Name = STDIN Job_Owner = knagara at carter-fe00.rcac.purdue.edu resources_used.cput = 19:37:42 resources_used.mem = 90248520kb resources_used.vmem = 116310764kb resources_used.walltime = 02:57:56 job_state = R queue = workq server = carter-adm.rcac.purdue.edu Checkpoint = u ctime = Thu Dec 1 11:22:01 2011 Error_Path = /dev/pts/0 exec_host = carter-a631/15+carter-a631/14+carter-a631/13+carter-a631/12+ca rter-a631/11+carter-a631/10+carter-a631/9+carter-a631/8+carter-a631/7+ carter-a631/6+carter-a631/5+carter-a631/4+carter-a631/3+carter-a631/2+ carter-a631/1+carter-a631/0+carter-a630/15+carter-a630/14+carter-a630/ 13+carter-a630/12+carter-a630/11+carter-a630/10+carter-a630/9+carter-a 630/8+carter-a630/7+carter-a630/6+carter-a630/5+carter-a630/4+carter-a 630/3+carter-a630/2+carter-a630/1+carter-a630/0+carter-a629/15+carter- a629/14+carter-a629/13+carter-a629/12+carter-a629/11+carter-a629/10+ca rter-a629/9+carter-a629/8+carter-a629/7+carter-a629/6+carter-a629/5+ca rter-a629/4+carter-a629/3+carter-a629/2+carter-a629/1+carter-a629/0+ca rter-a628/15+carter-a628/14+carter-a628/13+carter-a628/12+carter-a628/ 11+carter-a628/10+carter-a628/9+carter-a628/8+carter-a628/7+carter-a62 8/6+carter-a628/5+carter-a628/4+carter-a628/3+carter-a628/2+carter-a62 8/1+carter-a628/0+carter-a615/15+carter-a615/14+carter-a615/13+carter- a615/12+carter-a615/11+carter-a615/10+carter-a615/9+carter-a615/8+cart er-a615/7+carter-a615/6+carter-a615/5+carter-a615/4+carter-a615/3+cart er-a615/2+carter-a615/1+carter-a615/0 exec_port = 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15 003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+ 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500 3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15 003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+ 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500 3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003 Hold_Types = n interactive = True Join_Path = n Keep_Files = n Mail_Points = a mtime = Thu Dec 1 11:23:10 2011 Output_Path = /dev/pts/0 Priority = 0 qtime = Thu Dec 1 11:22:01 2011 Rerunable = False Resource_List.neednodes = 5:ppn=16 Resource_List.nodect = 5 Resource_List.nodes = 5:ppn=16 Resource_List.walltime = 04:00:00 session_id = 6957 substate = 42 Variable_List = PBS_O_QUEUE=workq,PBS_O_HOME=/home/ba01/u111/knagara, PBS_O_LANG=C,PBS_O_LOGNAME=knagara, PBS_O_PATH=/usr/lib64/qt-3.3/bin:/opt/platform_mpi/bin:/usr/local/bin :/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/clustertest/bin:/o pt/cuda/bin:/opt/cuda/C/bin/linux/release:/opt/hpss/bin:/opt/hsi/bin:/ opt/bin:/usr/pbs/bin,PBS_O_MAIL=/var/spool/mail/knagara, PBS_O_SHELL=/usr/local/bin/bash, PBS_O_HOST=carter-fe00.rcac.purdue.edu, PBS_SERVER=carter-adm.rcac.purdue.edu, PBS_O_WORKDIR=/home/ba01/u111/knagara euser = knagara egroup = itap hashname = 1622.carter-adm.rcac.purdue.edu queue_rank = 109 queue_type = E etime = Thu Dec 1 11:22:01 2011 submit_args = -l nodes=5:ppn=16 -I start_time = Thu Dec 1 11:22:14 2011 Walltime.Remaining = 3706 start_count = 1 fault_tolerant = False submit_host = carter-fe00.rcac.purdue.edu init_work_dir = /home/ba01/u111/knagara [root at carter-adm accounting]# pbsnodes carter-a615 carter-a615 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767209,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2960692084,gres=,loadave=0.00,ncpus=16,physmem=32841344kb,availmem=48544800kb,totmem=49618552kb,idletime=97455,nusers=1,nsessions=1,sessions=10669,uname=Linux carter-a615.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a628 carter-a628 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767252,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2959716791,gres=,loadave=0.31,ncpus=16,physmem=32841344kb,availmem=48523232kb,totmem=49618552kb,idletime=97387,nusers=0,nsessions=0,uname=Linux carter-a628.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a629 carter-a629 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767259,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2958375729,gres=,loadave=0.00,ncpus=16,physmem=32841344kb,availmem=48550744kb,totmem=49618552kb,idletime=97396,nusers=0,nsessions=0,uname=Linux carter-a629.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a630 carter-a630 state = free np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767263,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2959950109,gres=,loadave=0.01,ncpus=16,physmem=32841344kb,availmem=48526672kb,totmem=49618552kb,idletime=97399,nusers=0,nsessions=0,uname=Linux carter-a630.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a631 carter-a631 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767255,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2883619488,gres=,loadave=0.00,ncpus=16,physmem=32841344kb,availmem=48547336kb,totmem=49618552kb,idletime=97387,nusers=1,nsessions=1,sessions=6957,uname=Linux carter-a631.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# The node definition are as follows, carter-a614 np=16 carter carter-a615 np=16 carter carter-a616 np=16 carter carter-a617 np=16 carter carter-a618 np=16 carter carter-a619 np=16 carter carter-a620 np=16 carter carter-a621 np=16 carter carter-a622 np=16 carter carter-a623 np=16 carter carter-a624 np=16 carter carter-a625 np=16 carter carter-a626 np=16 carter carter-a627 np=16 carter carter-a628 np=16 carter carter-a629 np=16 carter carter-a630 np=16 carter carter-a631 np=16 carter carter-a632 np=16 carter Any users have the similar issue? Thanks, Shaomin -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111201/b1d2cb1f/attachment-0001.html From gus at ldeo.columbia.edu Fri Dec 2 09:40:13 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 2 Dec 2011 11:40:13 -0500 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED86182.2000202@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> Message-ID: <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu> Stdout and stderr by default stay in the first node on your list, until the job ends. At that point they're transferred to the work directory. Follow Christopher Samuel's recommendation for the pbs_mom.conf files [on all nodes!] I hope this helps, Gus Correa On Dec 2, 2011, at 12:26 AM, Hongsheng Zhao wrote: > Hi all, > > In my pbs script, I use the following codes to generate the PBS log files: > > -------- > # Set filenames for PBS to log standard error and standard output. > #PBS -o stdout > #PBS -e stderr > ... > #Start job from the directory it was submitted > cd $PBS_O_WORKDIR > -------- > > After the job has been executed successfully, I cann't find the stdout > and stderr files under the job's directory. Any hints on this issue? > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Fri Dec 2 09:57:57 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 2 Dec 2011 11:57:57 -0500 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4ED87EF9.7070306@yahoo.com.cn> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> Message-ID: <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> As Chris said, the jobs should continue running when you restart the pbs_server. Note that pbs_server runs only on your management node [node32, right?]. Besides, you should have a scheduler running also in the management node. The scheduler could be the very simple pbs_sched [First in first out policy], or something more sophisticated like maui [open source] or moab [proprietary]. On the other hand, pbs_mom runs on *all* compute nodes. If your management node is also used for computations, then it should also run pbs_mom. If it is not used for computations you don't need to run pbs_mom there. The compute nodes do *not* run the pbs_server or the scheduler. Your $TORQUEserver_name file should contains the name of your management node. Your $TORQUE/server_priv/nodes file should contain the list of nodes, the number of CPUs on each, and perhaps their properties [if they're different from each other]. Something like: node01 np=8 ... [Here $TORQUE is the directory where you installed Torque/PBS.] Some emails ago you seem to have said the nodes file was under acl_XXX, I am not sure, but check it out to make sure the nodes file is in the right location. Torque and Maui have very good documentation, not hard to understand even for non-native English speakers [like you and me]. It will save you a lot of time and headaches in the long run if you take some time now to read it: General URL, follow the links for Torque, Maui, etc: http://www.adaptivecomputing.com/resources/docs/ I hope this helps, Gus Correa On Dec 2, 2011, at 2:32 AM, Hongsheng Zhao wrote: > On 12/02/2011 01:46 PM, Christopher Samuel wrote: >>> But, this will have influence on all of the running jobs on this hpc. >> It shouldn't do, they should be able to carry on running quite happily >> if you restart the pbs_server. > > Do you mean it won't have any influence on all of the running jobs at > all? I'm not a English native speaker, sorry for my naive re-asking. > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Fri Dec 2 09:58:26 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 2 Dec 2011 11:58:26 -0500 Subject: [torqueusers] qsub exclude nodes with certain features Message-ID: <3A79D1FF-A7F7-49A9-AF7C-BD3BD330C1F6@nyu.edu> Hello Everyone, I am trying to restrict some interactive jobs going on to some nodes in the cluster. I always use features through qsub -l feature= to make jobs go on to nodes with certain features. I want to know whether it is possible to make jobs not go on to nodes with certain features. I googled and found nothing. We use moab scheduler. Just thought of writing here before I do this through wrapper. If there is no way of doing to exclude nodes with certain features, I will have to assign features to all the nodes I want jobs to go onto (which is a big number). Where as the number of nodes I don't want jobs go onto is a very small number (just 4). Please let me know if anyone has any idea. Thanks in advance, Sreedhar. --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111202/f0827f8c/attachment.html From gus at ldeo.columbia.edu Fri Dec 2 10:01:33 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 2 Dec 2011 12:01:33 -0500 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED8811E.6060706@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> Message-ID: <163DADEE-53A0-4042-AB06-2B9D70F162F0@ldeo.columbia.edu> On Dec 2, 2011, at 2:41 AM, Hongsheng Zhao wrote: > On 12/02/2011 01:48 PM, Christopher Samuel wrote: >>> After the job has been executed successfully, I cann't find the stdout >>>> and stderr files under the job's directory. Any hints on this issue? >> If the users home directories are all under /home, and if /home is >> mounted across the cluster (say via NFS, Lustre, GPFS or similar) then >> you can put: >> >> $usecp *:/home /home >> >> in your pbs_mom.conf file to tell pbs_mom to not use scp to copy files >> back but to just cp them instead. > > I've try to find the pbs_mom.conf on the management node of my cluster > but failed. Any hints? > This configuration file should be present in all compute nodes. If your management node is also used for computation, they it should also run pbs_mom, and should have also the same pbs_mom.conf file. Most of your questions so far are clearly answered in the Torque administrator guide. Please, see my previous email. You can read them here: http://www.adaptivecomputing.com/resources/docs/ >> >> If you do need to use scp because there is no shared /home directory >> then that's a bit more complicated, but only worth investigating in that >> (unlikely) scenario. > > Thanks a lot for quickly and helpful reply ;-) > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Fri Dec 2 10:23:51 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 2 Dec 2011 12:23:51 -0500 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <163DADEE-53A0-4042-AB06-2B9D70F162F0@ldeo.columbia.edu> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> <163DADEE-53A0-4042-AB06-2B9D70F162F0@ldeo.columbia.edu> Message-ID: Actually the mom configuration file is named 'config' only, and is located at $TORQUE/mom_priv/config on all compute nodes. I hope this helps. On Dec 2, 2011, at 12:01 PM, Gustavo Correa wrote: > > On Dec 2, 2011, at 2:41 AM, Hongsheng Zhao wrote: > >> On 12/02/2011 01:48 PM, Christopher Samuel wrote: >>>> After the job has been executed successfully, I cann't find the stdout >>>>> and stderr files under the job's directory. Any hints on this issue? >>> If the users home directories are all under /home, and if /home is >>> mounted across the cluster (say via NFS, Lustre, GPFS or similar) then >>> you can put: >>> >>> $usecp *:/home /home >>> >>> in your pbs_mom.conf file to tell pbs_mom to not use scp to copy files >>> back but to just cp them instead. >> >> I've try to find the pbs_mom.conf on the management node of my cluster >> but failed. Any hints? >> > > This configuration file should be present in all compute nodes. > If your management node is also used for computation, they it should also run pbs_mom, > and should have also the same pbs_mom.conf file. > > Most of your questions so far are clearly answered in the Torque administrator guide. > Please, see my previous email. > You can read them here: > > http://www.adaptivecomputing.com/resources/docs/ > >>> >>> If you do need to use scp because there is no shared /home directory >>> then that's a bit more complicated, but only worth investigating in that >>> (unlikely) scenario. >> >> Thanks a lot for quickly and helpful reply ;-) >> >> Best regards >> -- >> Hongsheng Zhao >> School of Physics and Electrical Information Science, >> Ningxia University, Yinchuan 750021, China >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > From vanw at sabalcore.com Fri Dec 2 12:46:53 2011 From: vanw at sabalcore.com (Kevin Van Workum) Date: Fri, 2 Dec 2011 14:46:53 -0500 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED88B08.1080807@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> <4ED88B08.1080807@yahoo.com.cn> Message-ID: On Fri, Dec 2, 2011 at 3:23 AM, Hongsheng Zhao wrote: > On 12/02/2011 03:41 PM, Hongsheng Zhao wrote: > > On 12/02/2011 01:48 PM, Christopher Samuel wrote: > >>> >> After the job has been executed successfully, I cann't find the > stdout > >>>> >> > and stderr files under the job's directory. Any hints on > this issue? > >> > If the users home directories are all under /home, and if /home is > >> > mounted across the cluster (say via NFS, Lustre, GPFS or similar) > then > >> > you can put: > >> > > >> > $usecp *:/home /home > >> > > >> > in your pbs_mom.conf file to tell pbs_mom to not use scp to copy > files > >> > back but to just cp them instead. > > I've try to find the pbs_mom.conf on the management node of my cluster > > but failed. Any hints? > > > > After some test, I've found the following clues but failed to solve the > issue. Let me describe it as follows: > > 1- After the job has been executed, I'll always receive a message from > within my remote terminal as the following: > > --------- > You have new mail in /var/mail/zhaohongsheng > --------- > > So, I try to see the content of this file: > > ------------ > zhaohongsheng at node32:~/work/Dr.Zhao/castep_test> tail -20 > /var/mail/zhaohongsheng > PBS Job Id: 744.node32.nxu.edu.cn > Job Name: ZnO > Exec host: > > node29/7+node29/6+node29/5+node29/4+node29/3+node29/2+node29/1+node29/0+node33/15+node33/14+node33/13+node33/12+node33/11+node33/10+node33/9+node33/8+node33/7+node33/6+node33/5+node33/4+node33/3+node33/2+node33/1+node33/0 > An error has occurred processing your job, see below. > Post job file processing error; job 744.node32.nxu.edu.cn on host > > node29/7+node29/6+node29/5+node29/4+node29/3+node29/2+node29/1+node29/0+node33/15+node33/14+node33/13+node33/12+node33/11+node33/10+node33/9+node33/8+node33/7+node33/6+node33/5+node33/4+node33/3+node33/2+node33/1+node33/0 > > Unable to copy file > /opt/gridview/pbs/dispatcher/spool/744.node32.nxu.edu.cn.OU to > zhaohongsheng at node32 > :/public/home/zhaohongsheng/work/Dr.Zhao/castep_test/stdout > >>> error from copy > ssh: connect to host node32 port 22: Network is unreachable > lost connection > >>> end error output > Output retained on that host in: > /opt/gridview/pbs/dispatcher/undelivered/744.node32.nxu.edu.cn.OU > > Unable to copy file > /opt/gridview/pbs/dispatcher/spool/744.node32.nxu.edu.cn.ER to > zhaohongsheng at node32 > :/public/home/zhaohongsheng/work/Dr.Zhao/castep_test/stderr > >>> error from copy > ssh: connect to host node32 port 22: Network is unreachable > lost connection > >>> end error output > Output retained on that host in: > /opt/gridview/pbs/dispatcher/undelivered/744.node32.nxu.edu.cn.ER > ------------- > > Based on the above logfile, it looks like I haven't the anonymous ssh > access privilege to node32. But I've set the ssh accessing to the > node32 without using password, see following for detail: > > ---------- > zhaohongsheng at node32:~/work/Dr.Zhao/castep_test> ssh node32 > Last login: Sat Dec 3 01:11:22 2011 from 202.201.128.36 > zhaohongsheng at node32:~> > ------------- > > Any hints on this strange issue will be highly appreciated. > Can you ssh from node29 to node32? > > Thanks in advance. > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Kevin Van Workum, PhD Sabalcore Computing Inc. Run your code on 500 processors. Sign up for a free trial account. www.sabalcore.com 877-492-8027 ext. 11 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111202/a3bd54c9/attachment-0001.html From roman.ricardo at gmail.com Fri Dec 2 13:39:16 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Fri, 2 Dec 2011 14:39:16 -0600 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: My thanks to Glen, Lloyd and Gus for all your help on this topic =) This is my current situation: i have this script and a non-torque MPIEXEC 1.2.1p1: > #PBS -q uno > #PBS -l nodes=2 > echo "Nodes Assigned:" > cat $PBS_NODEFILE > echo "running... -l nodes=2 && -n 1" > /usr/local/bin/mpiexec -n 1 $HOME/a.out > > echo "running... -l nodes=2 && -n 2" > /usr/local/bin/mpiexec -n 2 $HOME/a.out > > echo "running... -l nodes=2 && -n 4" > /usr/local/bin/mpiexec -n 4 $HOME/a.out > > echo "running... -l nodes=2 && -n 8" > /usr/local/bin/mpiexec -n 8 $HOME/a.out > > > echo "done" and with that i get this output: > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.o75 > Nodes Assigned: > zarate-1 > zarate-1 > > running... -l nodes=2 && -n 1 > zarate-1: hello world from process 0 of 1 > > running... -l nodes=2 && -n 2 > zarate-1: hello world from process 0 of 2 > zarate-1: hello world from process 1 of 2 > > running... -l nodes=2 && -n 4 > rank 3 in job 6 zarate-1_50632 caused collective abort of all ranks > exit status of rank 3: return code 1 > > running... -l nodes=2 && -n 8 > zarate-0: hello world from process 2 of 8 zarate-2: hello world from process 3 of 8 > zarate-2: hello world from process 4 of 8 > zarate-3: hello world from process 6 of 8 > zarate-3: hello world from process 5 of 8 > zarate-1: hello world from process 0 of 8 > zarate-1: hello world from process 1 of 8 > zarate-1: hello world from process 7 of 8 > done and this error log: > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.e75 > Fatal error in MPI_Init: Other MPI error, error stack: > MPIR_Init_thread(394).................: Initialization failed > MPID_Init(135)........................: channel initialization failed > MPIDI_CH3_Init(43)....................: > MPID_nem_init(202)....................: > MPIDI_CH3I_Seg_commit(366)............: > MPIU_SHMW_Hnd_deserialize(358)........: > MPIU_SHMW_Seg_open(897)...............: > MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or > directory I'm starting my MPD daemons with mpdboot -n 4 and using a machines file that has my 4 nodes (zarate-0,1,2,3). So here i have some answered questions... 1) torque gives me 2 instances of zarate-1 because in my nodes files it has np=2, right? 2) why does (or should) it crash with 4 mpi proceses? 3) and how come that if PBS_NODEFILE has just zarate-1 2 times, when I run it with 8 mpi proceses it runs on all nodes?? thanks again for helping me guys, but this is all i need to make my cluster run officially! =) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111202/2c8db593/attachment.html From glen.beane at gmail.com Fri Dec 2 13:48:16 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Fri, 2 Dec 2011 15:48:16 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: Sent from my iPhone On Dec 2, 2011, at 3:39 PM, Ricardo Rom?n Brenes wrote: > My thanks to Glen, Lloyd and Gus for all your help on this topic =) > > This is my current situation: > > i have this script and a non-torque MPIEXEC 1.2.1p1: > #PBS -q uno > #PBS -l nodes=2 > echo "Nodes Assigned:" > cat $PBS_NODEFILE > echo "running... -l nodes=2 && -n 1" > /usr/local/bin/mpiexec -n 1 $HOME/a.out > > echo "running... -l nodes=2 && -n 2" > /usr/local/bin/mpiexec -n 2 $HOME/a.out > > echo "running... -l nodes=2 && -n 4" > /usr/local/bin/mpiexec -n 4 $HOME/a.out > > echo "running... -l nodes=2 && -n 8" > /usr/local/bin/mpiexec -n 8 $HOME/a.out > > > echo "done" > > and with that i get this output: > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.o75 > Nodes Assigned: > zarate-1 > zarate-1 > > running... -l nodes=2 && -n 1 > zarate-1: hello world from process 0 of 1 > > running... -l nodes=2 && -n 2 > zarate-1: hello world from process 0 of 2 > zarate-1: hello world from process 1 of 2 > > running... -l nodes=2 && -n 4 > rank 3 in job 6 zarate-1_50632 caused collective abort of all ranks > exit status of rank 3: return code 1 > > running... -l nodes=2 && -n 8 > zarate-0: hello world from process 2 of 8 > zarate-2: hello world from process 3 of 8 > zarate-2: hello world from process 4 of 8 > zarate-3: hello world from process 6 of 8 > zarate-3: hello world from process 5 of 8 > zarate-1: hello world from process 0 of 8 > zarate-1: hello world from process 1 of 8 > zarate-1: hello world from process 7 of 8 > done > > and this error log: > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.e75 > Fatal error in MPI_Init: Other MPI error, error stack: > MPIR_Init_thread(394).................: Initialization failed > MPID_Init(135)........................: channel initialization failed > MPIDI_CH3_Init(43)....................: > MPID_nem_init(202)....................: > MPIDI_CH3I_Seg_commit(366)............: > MPIU_SHMW_Hnd_deserialize(358)........: > MPIU_SHMW_Seg_open(897)...............: > MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or directory > > I'm starting my MPD daemons with mpdboot -n 4 and using a machines file that has my 4 nodes (zarate-0,1,2,3). > > So here i have some answered questions... > 1) torque gives me 2 instances of zarate-1 because in my nodes files it has np=2, right? Yes > 2) why does (or should) it crash with 4 mpi proceses? > 3) and how come that if PBS_NODEFILE has just zarate-1 2 times, when I run it with 8 mpi proceses it runs on all nodes?? > > Your mpiexec is ignoring the nodes assigned by Torque. It is just running on all of the nodes MPD is running on because you have not told it otherwise. Your life will be a lot easier if you download OSC's mpiexec and forget MPD. Or use OpenMPI, which has very good support for torque. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111202/3644102b/attachment.html From glen.beane at gmail.com Fri Dec 2 13:51:22 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Fri, 2 Dec 2011 15:51:22 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: Sent from my iPhone On Dec 2, 2011, at 3:48 PM, glen.beane at gmail.com wrote: > > > Sent from my iPhone > > On Dec 2, 2011, at 3:39 PM, Ricardo Rom?n Brenes wrote: > >> My thanks to Glen, Lloyd and Gus for all your help on this topic =) >> >> This is my current situation: >> >> i have this script and a non-torque MPIEXEC 1.2.1p1: >> #PBS -q uno >> #PBS -l nodes=2 >> echo "Nodes Assigned:" >> cat $PBS_NODEFILE >> echo "running... -l nodes=2 && -n 1" >> /usr/local/bin/mpiexec -n 1 $HOME/a.out >> >> echo "running... -l nodes=2 && -n 2" >> /usr/local/bin/mpiexec -n 2 $HOME/a.out >> >> echo "running... -l nodes=2 && -n 4" >> /usr/local/bin/mpiexec -n 4 $HOME/a.out >> >> echo "running... -l nodes=2 && -n 8" >> /usr/local/bin/mpiexec -n 8 $HOME/a.out >> >> >> echo "done" >> >> and with that i get this output: >> [rroman at zarate-0:~/outputs]$ cat a.torque.uno.o75 >> Nodes Assigned: >> zarate-1 >> zarate-1 >> >> running... -l nodes=2 && -n 1 >> zarate-1: hello world from process 0 of 1 >> >> running... -l nodes=2 && -n 2 >> zarate-1: hello world from process 0 of 2 >> zarate-1: hello world from process 1 of 2 >> >> running... -l nodes=2 && -n 4 >> rank 3 in job 6 zarate-1_50632 caused collective abort of all ranks >> exit status of rank 3: return code 1 >> >> running... -l nodes=2 && -n 8 >> zarate-0: hello world from process 2 of 8 >> zarate-2: hello world from process 3 of 8 >> zarate-2: hello world from process 4 of 8 >> zarate-3: hello world from process 6 of 8 >> zarate-3: hello world from process 5 of 8 >> zarate-1: hello world from process 0 of 8 >> zarate-1: hello world from process 1 of 8 >> zarate-1: hello world from process 7 of 8 >> done >> >> and this error log: >> [rroman at zarate-0:~/outputs]$ cat a.torque.uno.e75 >> Fatal error in MPI_Init: Other MPI error, error stack: >> MPIR_Init_thread(394).................: Initialization failed >> MPID_Init(135)........................: channel initialization failed >> MPIDI_CH3_Init(43)....................: >> MPID_nem_init(202)....................: >> MPIDI_CH3I_Seg_commit(366)............: >> MPIU_SHMW_Hnd_deserialize(358)........: >> MPIU_SHMW_Seg_open(897)...............: >> MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or directory >> >> I'm starting my MPD daemons with mpdboot -n 4 and using a machines file that has my 4 nodes (zarate-0,1,2,3). >> >> So here i have some answered questions... >> 1) torque gives me 2 instances of zarate-1 because in my nodes files it has np=2, right? > > Yes I forgot to mention you can change this behavior by modifying your Maui config. You can configure it so it will give you the exact number of nodes you request (EXACTNODE) rather than consolidate processes (EXACTPROC) >> 2) why does (or should) it crash with 4 mpi proceses? >> 3) and how come that if PBS_NODEFILE has just zarate-1 2 times, when I run it with 8 mpi proceses it runs on all nodes?? >> >> > > Your mpiexec is ignoring the nodes assigned by Torque. It is just running on all of the nodes MPD is running on because you have not told it otherwise. Your life will be a lot easier if you download OSC's mpiexec and forget MPD. Or use OpenMPI, which has very good support for torque. > > > > >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111202/dd6ccc4f/attachment-0001.html From gus at ldeo.columbia.edu Fri Dec 2 15:07:10 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 2 Dec 2011 17:07:10 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> Hi Ricardo I said, Glen said, Lloyd said, everybody said that you are not using the nodes that Torque assigns to the job. Your latest experiments just confirm this. Your experiments strongly suggest that your mpiexec command is reading a default machine file where you may have listed *all* nodes in your cluster. Therefore, it starts using the first nodes on the list [zarate-1] but then it moves on to the other nodes, ignoring completely what is in $PBS_NODEFILE. This is why when you launch mpiexec with -np=8 [your fourth mpiexec comand] you get all the nodes, not only zarate-1. Your MPICH2 is not integrated to Torque, so it doesn't know anything about the list of nodes in $PBS_NODEFILE, *UNLESS* you tell it about. You need to pass the $PBS_NODEFILE to the mpirun command with the -machinefile option [or -hostfile option, check the syntax with man mpiexec]. You can also pass this file [or a slightly edited version of it, to avoid repeated node names] to the mpd daemon, start the mpd ring before the job starts, stop it after the job ends. Alternatively you can launch an mpd ring on all nodes, but use only some nodes in mpiexec. However, the latter often causes problems, particularly if the program dies for a bug, or because the mpd daemon still didn't get ready for a new job, or something else. The third mpiexec command that you launched, which aborted, probably had this problem. BTW, mpd is deprecated, MPICH2 now uses the Hydra process launcher by default. My guess is that they changed because mpd often left leftover processes around, that may conflict with subsequent jobs. I cannot say anything about the Hydra integration to Torque because I never tried it. An alternative, which I like better because it can integrate with Torque, there is no messy mpd damon, and is easy to build and install, is the OSC mpiexec: http://www.osc.edu/~djohnson/mpiexec/index.php [IT is only mpiexec, but does a better job than the officlal MPICH2 launcher, IMHO.] Yet another alternative is OpenMPI, which you can easily build *with* torque support: http://www.open-mpi.org/ OpenMPI is probably the most flexible and complete MPI open source distribution that you can find. Whatever you use, if the MPI is not integrated to Torque, you must pass $PBS_NODEFILE to mpiexec/mpirun. I hope this helps. Gus Correa On Dec 2, 2011, at 3:39 PM, Ricardo Rom?n Brenes wrote: > My thanks to Glen, Lloyd and Gus for all your help on this topic =) > > This is my current situation: > > i have this script and a non-torque MPIEXEC 1.2.1p1: > #PBS -q uno > #PBS -l nodes=2 > echo "Nodes Assigned:" > cat $PBS_NODEFILE > echo "running... -l nodes=2 && -n 1" > /usr/local/bin/mpiexec -n 1 $HOME/a.out > > echo "running... -l nodes=2 && -n 2" > /usr/local/bin/mpiexec -n 2 $HOME/a.out > > echo "running... -l nodes=2 && -n 4" > /usr/local/bin/mpiexec -n 4 $HOME/a.out > > echo "running... -l nodes=2 && -n 8" > /usr/local/bin/mpiexec -n 8 $HOME/a.out > > > echo "done" > > and with that i get this output: > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.o75 > Nodes Assigned: > zarate-1 > zarate-1 > > running... -l nodes=2 && -n 1 > zarate-1: hello world from process 0 of 1 > > running... -l nodes=2 && -n 2 > zarate-1: hello world from process 0 of 2 > zarate-1: hello world from process 1 of 2 > > running... -l nodes=2 && -n 4 > rank 3 in job 6 zarate-1_50632 caused collective abort of all ranks > exit status of rank 3: return code 1 > > running... -l nodes=2 && -n 8 > zarate-0: hello world from process 2 of 8 > zarate-2: hello world from process 3 of 8 > zarate-2: hello world from process 4 of 8 > zarate-3: hello world from process 6 of 8 > zarate-3: hello world from process 5 of 8 > zarate-1: hello world from process 0 of 8 > zarate-1: hello world from process 1 of 8 > zarate-1: hello world from process 7 of 8 > done > > and this error log: > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.e75 > Fatal error in MPI_Init: Other MPI error, error stack: > MPIR_Init_thread(394).................: Initialization failed > MPID_Init(135)........................: channel initialization failed > MPIDI_CH3_Init(43)....................: > MPID_nem_init(202)....................: > MPIDI_CH3I_Seg_commit(366)............: > MPIU_SHMW_Hnd_deserialize(358)........: > MPIU_SHMW_Seg_open(897)...............: > MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or directory > > I'm starting my MPD daemons with mpdboot -n 4 and using a machines file that has my 4 nodes (zarate-0,1,2,3). > > So here i have some answered questions... > 1) torque gives me 2 instances of zarate-1 because in my nodes files it has np=2, right? > 2) why does (or should) it crash with 4 mpi proceses? > 3) and how come that if PBS_NODEFILE has just zarate-1 2 times, when I run it with 8 mpi proceses it runs on all nodes?? > > thanks again for helping me guys, but this is all i need to make my cluster run officially! =) > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Fri Dec 2 15:09:35 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Fri, 2 Dec 2011 16:09:35 -0600 Subject: [torqueusers] specific nodes In-Reply-To: <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> Message-ID: yeah pal! i jsut read that somewhere else! im going to upgrade my MPICH and then try to compile OSC mpiexec =) thank you and all of you guys for your help!!! =) BUT BEWARE! if i got more problems i'll get back here! heheh thanks! =) -Ricardo On Fri, Dec 2, 2011 at 4:07 PM, Gustavo Correa wrote: > Hi Ricardo > > I said, Glen said, Lloyd said, everybody said that you are not using the > nodes > that Torque assigns to the job. > Your latest experiments just confirm this. > Your experiments strongly suggest that your mpiexec command is reading a > default machine file where you may have listed *all* nodes in your cluster. > Therefore, it starts using the first nodes on the list [zarate-1] > but then it moves on to the other nodes, ignoring completely what is in > $PBS_NODEFILE. > This is why when you launch mpiexec with -np=8 [your fourth mpiexec comand] > you get all the nodes, not only zarate-1. > > Your MPICH2 is not integrated to Torque, so it doesn't know anything about > the > list of nodes in $PBS_NODEFILE, *UNLESS* you tell it about. > You need to pass the $PBS_NODEFILE to the mpirun command with the > -machinefile option > [or -hostfile option, check the syntax with man mpiexec]. > > You can also pass this file [or a slightly edited version of it, to avoid > repeated node names] > to the mpd daemon, start the mpd ring before the job starts, > stop it after the job ends. > Alternatively you can launch an mpd ring on all nodes, but use only some > nodes in mpiexec. > However, the latter often causes problems, particularly if the program dies > for a bug, or because the mpd daemon still didn't get ready for a new job, > or something else. > The third mpiexec command that you launched, which aborted, probably had > this problem. > > BTW, mpd is deprecated, MPICH2 now uses the Hydra process launcher by > default. > My guess is that they changed > because mpd often left leftover processes around, that may conflict > with subsequent jobs. > I cannot say anything about the Hydra integration to Torque because I > never tried it. > > An alternative, which I like better because it can integrate with Torque, > there is > no messy mpd damon, and is easy to build and install, is the OSC mpiexec: > > http://www.osc.edu/~djohnson/mpiexec/index.php > > [IT is only mpiexec, but does a better job than the officlal MPICH2 > launcher, IMHO.] > > Yet another alternative is OpenMPI, which you can easily build *with* > torque support: > > http://www.open-mpi.org/ > > OpenMPI is probably the most flexible and complete MPI open source > distribution that you > can find. > > Whatever you use, if the MPI is not integrated to Torque, you must pass > $PBS_NODEFILE > to mpiexec/mpirun. > > > I hope this helps. > > Gus Correa > > On Dec 2, 2011, at 3:39 PM, Ricardo Rom?n Brenes wrote: > > > My thanks to Glen, Lloyd and Gus for all your help on this topic =) > > > > This is my current situation: > > > > i have this script and a non-torque MPIEXEC 1.2.1p1: > > #PBS -q uno > > #PBS -l nodes=2 > > echo "Nodes Assigned:" > > cat $PBS_NODEFILE > > echo "running... -l nodes=2 && -n 1" > > /usr/local/bin/mpiexec -n 1 $HOME/a.out > > > > echo "running... -l nodes=2 && -n 2" > > /usr/local/bin/mpiexec -n 2 $HOME/a.out > > > > echo "running... -l nodes=2 && -n 4" > > /usr/local/bin/mpiexec -n 4 $HOME/a.out > > > > echo "running... -l nodes=2 && -n 8" > > /usr/local/bin/mpiexec -n 8 $HOME/a.out > > > > > > echo "done" > > > > and with that i get this output: > > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.o75 > > Nodes Assigned: > > zarate-1 > > zarate-1 > > > > running... -l nodes=2 && -n 1 > > zarate-1: hello world from process 0 of 1 > > > > running... -l nodes=2 && -n 2 > > zarate-1: hello world from process 0 of 2 > > zarate-1: hello world from process 1 of 2 > > > > running... -l nodes=2 && -n 4 > > rank 3 in job 6 zarate-1_50632 caused collective abort of all ranks > > exit status of rank 3: return code 1 > > > > running... -l nodes=2 && -n 8 > > zarate-0: hello world from process 2 of 8 > > zarate-2: hello world from process 3 of 8 > > zarate-2: hello world from process 4 of 8 > > zarate-3: hello world from process 6 of 8 > > zarate-3: hello world from process 5 of 8 > > zarate-1: hello world from process 0 of 8 > > zarate-1: hello world from process 1 of 8 > > zarate-1: hello world from process 7 of 8 > > done > > > > and this error log: > > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.e75 > > Fatal error in MPI_Init: Other MPI error, error stack: > > MPIR_Init_thread(394).................: Initialization failed > > MPID_Init(135)........................: channel initialization failed > > MPIDI_CH3_Init(43)....................: > > MPID_nem_init(202)....................: > > MPIDI_CH3I_Seg_commit(366)............: > > MPIU_SHMW_Hnd_deserialize(358)........: > > MPIU_SHMW_Seg_open(897)...............: > > MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or > directory > > > > I'm starting my MPD daemons with mpdboot -n 4 and using a machines file > that has my 4 nodes (zarate-0,1,2,3). > > > > So here i have some answered questions... > > 1) torque gives me 2 instances of zarate-1 because in my nodes files it > has np=2, right? > > 2) why does (or should) it crash with 4 mpi proceses? > > 3) and how come that if PBS_NODEFILE has just zarate-1 2 times, when I > run it with 8 mpi proceses it runs on all nodes?? > > > > thanks again for helping me guys, but this is all i need to make my > cluster run officially! =) > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111202/bc3209b5/attachment.html From gus at ldeo.columbia.edu Fri Dec 2 15:34:26 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 2 Dec 2011 17:34:26 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> Message-ID: <621A4540-D776-4FB3-B1DB-F6715120E2E5@ldeo.columbia.edu> Have you tried this inside your Torque script? mpirun -f $PBS_NODEFILE -n 8 ./a.out Which mpirun do you have? Try 'which mpirun' then 'ls -l ' on the resulting full path. Depending on your MPICH2 installation, you may be mixing mpd with hydra or something else. Have you read this ? [Hydra, see resource manager integration] http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager "Resource Manager Integration Hydra provides capability to integrate with different resource managers. You can pick these through the mpiexec option -rmk: shell$ mpiexec -rmk pbs ./app By default, mpiexec will coordinate with PBS to use all the allocated nodes to run the app. You can also force it to run the application using a different number of processes or on a different set of nodes using: shell$ mpiexec -rmk pbs -n 4 -f ~/hosts ./app This can also be controlled by using the HYDRA_RMK environment variable." On Dec 2, 2011, at 5:09 PM, Ricardo Rom?n Brenes wrote: > yeah pal! i jsut read that somewhere else! im going to upgrade my MPICH and then try to compile OSC mpiexec =) > > thank you and all of you guys for your help!!! =) > > BUT BEWARE! if i got more problems i'll get back here! heheh > > thanks! =) > > -Ricardo > > On Fri, Dec 2, 2011 at 4:07 PM, Gustavo Correa wrote: > Hi Ricardo > > I said, Glen said, Lloyd said, everybody said that you are not using the nodes > that Torque assigns to the job. > Your latest experiments just confirm this. > Your experiments strongly suggest that your mpiexec command is reading a > default machine file where you may have listed *all* nodes in your cluster. > Therefore, it starts using the first nodes on the list [zarate-1] > but then it moves on to the other nodes, ignoring completely what is in $PBS_NODEFILE. > This is why when you launch mpiexec with -np=8 [your fourth mpiexec comand] > you get all the nodes, not only zarate-1. > > Your MPICH2 is not integrated to Torque, so it doesn't know anything about the > list of nodes in $PBS_NODEFILE, *UNLESS* you tell it about. > You need to pass the $PBS_NODEFILE to the mpirun command with the -machinefile option > [or -hostfile option, check the syntax with man mpiexec]. > > You can also pass this file [or a slightly edited version of it, to avoid repeated node names] > to the mpd daemon, start the mpd ring before the job starts, > stop it after the job ends. > Alternatively you can launch an mpd ring on all nodes, but use only some > nodes in mpiexec. > However, the latter often causes problems, particularly if the program dies > for a bug, or because the mpd daemon still didn't get ready for a new job, or something else. > The third mpiexec command that you launched, which aborted, probably had this problem. > > BTW, mpd is deprecated, MPICH2 now uses the Hydra process launcher by default. > My guess is that they changed > because mpd often left leftover processes around, that may conflict > with subsequent jobs. > I cannot say anything about the Hydra integration to Torque because I never tried it. > > An alternative, which I like better because it can integrate with Torque, there is > no messy mpd damon, and is easy to build and install, is the OSC mpiexec: > > http://www.osc.edu/~djohnson/mpiexec/index.php > > [IT is only mpiexec, but does a better job than the officlal MPICH2 launcher, IMHO.] > > Yet another alternative is OpenMPI, which you can easily build *with* torque support: > > http://www.open-mpi.org/ > > OpenMPI is probably the most flexible and complete MPI open source distribution that you > can find. > > Whatever you use, if the MPI is not integrated to Torque, you must pass $PBS_NODEFILE > to mpiexec/mpirun. > > > I hope this helps. > > Gus Correa > > On Dec 2, 2011, at 3:39 PM, Ricardo Rom?n Brenes wrote: > > > My thanks to Glen, Lloyd and Gus for all your help on this topic =) > > > > This is my current situation: > > > > i have this script and a non-torque MPIEXEC 1.2.1p1: > > #PBS -q uno > > #PBS -l nodes=2 > > echo "Nodes Assigned:" > > cat $PBS_NODEFILE > > echo "running... -l nodes=2 && -n 1" > > /usr/local/bin/mpiexec -n 1 $HOME/a.out > > > > echo "running... -l nodes=2 && -n 2" > > /usr/local/bin/mpiexec -n 2 $HOME/a.out > > > > echo "running... -l nodes=2 && -n 4" > > /usr/local/bin/mpiexec -n 4 $HOME/a.out > > > > echo "running... -l nodes=2 && -n 8" > > /usr/local/bin/mpiexec -n 8 $HOME/a.out > > > > > > echo "done" > > > > and with that i get this output: > > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.o75 > > Nodes Assigned: > > zarate-1 > > zarate-1 > > > > running... -l nodes=2 && -n 1 > > zarate-1: hello world from process 0 of 1 > > > > running... -l nodes=2 && -n 2 > > zarate-1: hello world from process 0 of 2 > > zarate-1: hello world from process 1 of 2 > > > > running... -l nodes=2 && -n 4 > > rank 3 in job 6 zarate-1_50632 caused collective abort of all ranks > > exit status of rank 3: return code 1 > > > > running... -l nodes=2 && -n 8 > > zarate-0: hello world from process 2 of 8 > > zarate-2: hello world from process 3 of 8 > > zarate-2: hello world from process 4 of 8 > > zarate-3: hello world from process 6 of 8 > > zarate-3: hello world from process 5 of 8 > > zarate-1: hello world from process 0 of 8 > > zarate-1: hello world from process 1 of 8 > > zarate-1: hello world from process 7 of 8 > > done > > > > and this error log: > > [rroman at zarate-0:~/outputs]$ cat a.torque.uno.e75 > > Fatal error in MPI_Init: Other MPI error, error stack: > > MPIR_Init_thread(394).................: Initialization failed > > MPID_Init(135)........................: channel initialization failed > > MPIDI_CH3_Init(43)....................: > > MPID_nem_init(202)....................: > > MPIDI_CH3I_Seg_commit(366)............: > > MPIU_SHMW_Hnd_deserialize(358)........: > > MPIU_SHMW_Seg_open(897)...............: > > MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or directory > > > > I'm starting my MPD daemons with mpdboot -n 4 and using a machines file that has my 4 nodes (zarate-0,1,2,3). > > > > So here i have some answered questions... > > 1) torque gives me 2 instances of zarate-1 because in my nodes files it has np=2, right? > > 2) why does (or should) it crash with 4 mpi proceses? > > 3) and how come that if PBS_NODEFILE has just zarate-1 2 times, when I run it with 8 mpi proceses it runs on all nodes?? > > > > thanks again for helping me guys, but this is all i need to make my cluster run officially! =) > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From zhaohscas at yahoo.com.cn Fri Dec 2 21:23:55 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Sat, 03 Dec 2011 12:23:55 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> <4ED88B08.1080807@yahoo.com.cn> Message-ID: <4ED9A45B.3050201@yahoo.com.cn> On 12/03/2011 03:46 AM, Kevin Van Workum wrote: > Can you ssh from node29 to node32? Good hints. I can ping node29 and ssh to node29 from node32: ------- zhaohongsheng at node32:~> ping node29 PING node29 (192.168.1.29) 56(84) bytes of data. 64 bytes from node29 (192.168.1.29): icmp_seq=1 ttl=64 time=0.070 ms 64 bytes from node29 (192.168.1.29): icmp_seq=2 ttl=64 time=0.074 ms 64 bytes from node29 (192.168.1.29): icmp_seq=3 ttl=64 time=0.069 ms 64 bytes from node29 (192.168.1.29): icmp_seq=4 ttl=64 time=0.070 ms --- node29 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3002ms rtt min/avg/max/mdev = 0.069/0.070/0.074/0.010 ms zhaohongsheng at node32:~> ssh node29 Last login: Sat Dec 3 15:24:08 2011 from node32.nxu.edu.cn zhaohongsheng at node29:~> -------- But, when I ssh to node29, then try to ping node32 and ssh to node32, both of these operations will failed: ----------- zhaohongsheng at node32:~> ssh node29 Last login: Sat Dec 3 15:25:11 2011 from node32.nxu.edu.cn zhaohongsheng at node29:~> ping node32 connect: Network is unreachable zhaohongsheng at node29:~> ssh node32 ssh: connect to host node32 port 22: Network is unreachable zhaohongsheng at node29:~> ----------- Could you please give me some more hints on solving this issue? Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Fri Dec 2 21:40:50 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Sat, 03 Dec 2011 12:40:50 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> <163DADEE-53A0-4042-AB06-2B9D70F162F0@ldeo.columbia.edu> Message-ID: <4ED9A852.8050704@yahoo.com.cn> On 12/03/2011 01:23 AM, Gustavo Correa wrote: > Actually the mom configuration file is named 'config' only, > and is located at $TORQUE/mom_priv/config on all compute nodes. > I hope this helps. I cann't find the above file but I can find the folder named as mom_priv for my case. See the following for detail: --------- node32:/opt/gridview # cd pbs/ node32:/opt/gridview/pbs # ls auth dispatcher-sched jobmanager.properties pbs_killjob.sh test.pbs dispatcher hc.qmgr pbs_backup.sh pbs_reload.sh node32:/opt/gridview/pbs # cd dispatcher node32:/opt/gridview/pbs/dispatcher # ls aux include mom_logs sbin server_logs spool bin lib mom_priv sched_logs server_name undelivered checkpoint man pbs_environment sched_priv server_priv node32:/opt/gridview/pbs/dispatcher # cd mom_priv/ node32:/opt/gridview/pbs/dispatcher/mom_priv # ls jobs mom.lock -------- Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Fri Dec 2 22:09:09 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Sat, 03 Dec 2011 13:09:09 +0800 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> Message-ID: <4ED9AEF5.8080709@yahoo.com.cn> On 12/03/2011 12:57 AM, Gustavo Correa wrote: > As Chris said, the jobs should continue running when you restart the pbs_server. > > Note that pbs_server runs only on your management node [node32, right?]. > Besides, you should have a scheduler running also in the management node. > The scheduler could be the very simple pbs_sched [First in first out policy], > or something more sophisticated like maui [open source] or moab [proprietary]. > On the other hand, pbs_mom runs on*all* compute nodes. > If your management node is also used for computations, then it should also run pbs_mom. > If it is not used for computations you don't need to run pbs_mom there. > The compute nodes do*not* run the pbs_server or the scheduler. Thanks a lot, it seems I should first read the manual for Torque and Maui to have some detailed understating on them. > > Your $TORQUEserver_name file should contains the name of your management node. Yes, it has the node32 in it only. > Your $TORQUE/server_priv/nodes file should contain the list of nodes, > the number of CPUs on each, and perhaps their properties [if they're different from each other]. > Something like: > node01 np=8 > ... > > [Here $TORQUE is the directory where you installed Torque/PBS.] > > Some emails ago you seem to have said the nodes file was under acl_XXX, I am not sure, > but check it out to make sure the nodes file is in the right location. I only have one nodes file in the following location: /opt/gridview/pbs/dispatcher/server_priv/nodes And the contents is as follows currently: ------ node32:/opt/gridview/pbs/dispatcher/server_priv # cat /opt/gridview/pbs/dispatcher/server_priv/nodes node1 np=8 node2 np=8 node3 np=8 node4 np=8 node5 np=8 node6 np=8 node7 np=8 node8 np=8 node9 np=8 node11 np=8 node10 np=3 node12 np=8 node13 np=8 node14 np=8 node15 np=8 node16 np=8 node17 np=8 node18 np=8 node19 np=8 node20 np=8 node21 np=8 node22 np=8 node23 np=8 node24 np=8 node25 np=8 node26 np=8 node27 np=8 node28 np=8 node29 np=8 node30 np=8 node33 np=16 node32:/opt/gridview/pbs/dispatcher/server_priv # -------- > > Torque and Maui have very good documentation, not hard to understand > even for non-native English speakers [like you and me]. > It will save you a lot of time and headaches in the long run if you take some time now to read it: > > General URL, follow the links for Torque, Maui, etc: > http://www.adaptivecomputing.com/resources/docs/ Great, thanks a lot. I've downloaded them for reading ;-) Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From gus at ldeo.columbia.edu Sat Dec 3 06:13:47 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sat, 3 Dec 2011 08:13:47 -0500 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED9A852.8050704@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> <163DADEE-53A0-4042-AB06-2B9D70F162F0@ldeo.columbia.edu> <4ED9A852.8050704@yahoo.com.cn> Message-ID: <70329584-334F-4A2A-8CF8-13F6D8FF5F3F@ldeo.columbia.edu> Hi Hongsheng Answer below. On Dec 2, 2011, at 11:40 PM, Hongsheng Zhao wrote: > On 12/03/2011 01:23 AM, Gustavo Correa wrote: >> Actually the mom configuration file is named 'config' only, >> and is located at $TORQUE/mom_priv/config on all compute nodes. >> I hope this helps. > > I cann't find the above file but I can find the folder named as mom_priv > for my case. See the following for detail: > You can create the $TORQUE/mom_priv/config file on all nodes. Most likely to fit your needs they should have the same contents. Something like this: $pbsserver node32 [I am guessing this is your management node where pbs_server runs] $usecp *:/home /home [See Chris Samuel's email. A line of these per NFS mounted directory] By all means, read the Torque Administrator Guide in the link I sent you some emails ago. All these details are clearly explained there. I hope this helps. Gus Correa > --------- > node32:/opt/gridview # cd pbs/ > node32:/opt/gridview/pbs # ls > auth dispatcher-sched jobmanager.properties pbs_killjob.sh > test.pbs > dispatcher hc.qmgr pbs_backup.sh pbs_reload.sh > node32:/opt/gridview/pbs # cd dispatcher > node32:/opt/gridview/pbs/dispatcher # ls > aux include mom_logs sbin server_logs spool > bin lib mom_priv sched_logs server_name undelivered > checkpoint man pbs_environment sched_priv server_priv > node32:/opt/gridview/pbs/dispatcher # cd mom_priv/ > node32:/opt/gridview/pbs/dispatcher/mom_priv # ls > jobs mom.lock > -------- > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Sat Dec 3 06:18:08 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sat, 3 Dec 2011 08:18:08 -0500 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED9A45B.3050201@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> <4ED88B08.1080807@yahoo.com.cn> <4ED9A45B.3050201@yahoo.com.cn> Message-ID: Hi Hongsheng On Dec 2, 2011, at 11:23 PM, Hongsheng Zhao wrote: > On 12/03/2011 03:46 AM, Kevin Van Workum wrote: >> Can you ssh from node29 to node32? > > Good hints. I can ping node29 and ssh to node29 from node32: Actually, for Torque to run right all compute nodes must be able to connect to the management node [32 in your case]. However, to run parallel MPI programs you most likely need to be able to ssh across *any pair* of nodes without password. Can you do this across any pair of nodes? Gus Correa > > ------- > zhaohongsheng at node32:~> ping node29 > PING node29 (192.168.1.29) 56(84) bytes of data. > 64 bytes from node29 (192.168.1.29): icmp_seq=1 ttl=64 time=0.070 ms > 64 bytes from node29 (192.168.1.29): icmp_seq=2 ttl=64 time=0.074 ms > 64 bytes from node29 (192.168.1.29): icmp_seq=3 ttl=64 time=0.069 ms > 64 bytes from node29 (192.168.1.29): icmp_seq=4 ttl=64 time=0.070 ms > > --- node29 ping statistics --- > 4 packets transmitted, 4 received, 0% packet loss, time 3002ms > rtt min/avg/max/mdev = 0.069/0.070/0.074/0.010 ms > zhaohongsheng at node32:~> ssh node29 > Last login: Sat Dec 3 15:24:08 2011 from node32.nxu.edu.cn > zhaohongsheng at node29:~> > -------- > > But, when I ssh to node29, then try to ping node32 and ssh to node32, > both of these operations will failed: > > ----------- > zhaohongsheng at node32:~> ssh node29 > Last login: Sat Dec 3 15:25:11 2011 from node32.nxu.edu.cn > zhaohongsheng at node29:~> ping node32 > connect: Network is unreachable > zhaohongsheng at node29:~> ssh node32 > ssh: connect to host node32 port 22: Network is unreachable > zhaohongsheng at node29:~> > ----------- > > Could you please give me some more hints on solving this issue? > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Sat Dec 3 06:48:23 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sat, 3 Dec 2011 08:48:23 -0500 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4ED9AEF5.8080709@yahoo.com.cn> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> Message-ID: <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> Hi Hongsheng Answers below On Dec 3, 2011, at 12:09 AM, Hongsheng Zhao wrote: > On 12/03/2011 12:57 AM, Gustavo Correa wrote: >> As Chris said, the jobs should continue running when you restart the pbs_server. >> >> Note that pbs_server runs only on your management node [node32, right?]. >> Besides, you should have a scheduler running also in the management node. >> The scheduler could be the very simple pbs_sched [First in first out policy], >> or something more sophisticated like maui [open source] or moab [proprietary]. >> On the other hand, pbs_mom runs on*all* compute nodes. >> If your management node is also used for computations, then it should also run pbs_mom. >> If it is not used for computations you don't need to run pbs_mom there. >> The compute nodes do*not* run the pbs_server or the scheduler. > > Thanks a lot, it seems I should first read the manual for Torque and > Maui to have some detailed understating on them. It will help. Read the sections that explain how to do a quick Torque setup. They are short and will give you a good overview. Maybe read first Chapter 1, sections 2.1-2.3,2.5,3.1-3.3,4.1, Chapter 5, sections 6.1-6.2, Chapter 7, Chapter 9. You can leave the Maui documents for later, just keep the standard maui.cfg configuration file for now. > >> >> Your $TORQUEserver_name file should contains the name of your management node. > > Yes, it has the node32 in it only. That's good. > >> Your $TORQUE/server_priv/nodes file should contain the list of nodes, >> the number of CPUs on each, and perhaps their properties [if they're different from each other]. >> Something like: >> node01 np=8 >> ... >> >> [Here $TORQUE is the directory where you installed Torque/PBS.] >> >> Some emails ago you seem to have said the nodes file was under acl_XXX, I am not sure, >> but check it out to make sure the nodes file is in the right location. > > I only have one nodes file in the following location: > > /opt/gridview/pbs/dispatcher/server_priv/nodes Right, and this is on node32, correct? > > And the contents is as follows currently: > > ------ > node32:/opt/gridview/pbs/dispatcher/server_priv # cat > /opt/gridview/pbs/dispatcher/server_priv/nodes > node1 np=8 > node2 np=8 > node3 np=8 > node4 np=8 > node5 np=8 > node6 np=8 > node7 np=8 > node8 np=8 > node9 np=8 > node11 np=8 > node10 np=3 > node12 np=8 > node13 np=8 > node14 np=8 > node15 np=8 > node16 np=8 > node17 np=8 > node18 np=8 > node19 np=8 > node20 np=8 > node21 np=8 > node22 np=8 > node23 np=8 > node24 np=8 > node25 np=8 > node26 np=8 > node27 np=8 > node28 np=8 > node29 np=8 > node30 np=8 > node33 np=16 > node32:/opt/gridview/pbs/dispatcher/server_priv # > -------- > Somehow you have the line: node10 np=3 Is this a typo on your email? A typo on your nodes file, perhaps? Is node10 different from the others and has 3 cores instead of 8? Should it be perhaps this? node10 np=8 Likewise for node33, but my guess is that nod33 is actually bigger than the other nodes, right? Also, if you are not going to use node32 for computations, i.e. like a regular compute node, you should remove it from the nodes file above. On the other hand, if you want to use node32 also for computations, it should be part of the nodes file. In that case you need to run pbs_mom *also* on node32. In any case, they syntax on the line for node32 doesn't look right [no np=8 or similar, and the path name :/opt/gridview/pbs/dispatcher/server_priv problably doesn't belong there either]. If you want to use node32 for computations, maybe you could use np=4 [node32 np=4], to leave some cores available for administrative tasks, user login sessions, compilation, etc. On node32, what is the output of 'pbsnodes' ? Does it list all nodes in your nodes file, with the correct np? >> >> Torque and Maui have very good documentation, not hard to understand >> even for non-native English speakers [like you and me]. >> It will save you a lot of time and headaches in the long run if you take some time now to read it: >> >> General URL, follow the links for Torque, Maui, etc: >> http://www.adaptivecomputing.com/resources/docs/ > > Great, thanks a lot. I've downloaded them for reading ;-) You are welcome. The documentation is very good. Soon you will be able to help out other Torque users here in the list! Good luck, Gus Correa > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From fotis at cern.ch Sat Dec 3 08:47:40 2011 From: fotis at cern.ch (Fotis Georgatos) Date: Sat, 3 Dec 2011 17:47:40 +0200 Subject: [torqueusers] pbsnodes still show node state=free with all np assigned In-Reply-To: <4ED9D4A5.7050202@cern.ch> References: <4ED9D4A5.7050202@cern.ch> Message-ID: <4EDA449C.3020101@cern.ch> Hi Shaomin et al, I believe that situation arises when pbs_mom crashes/restarts. It helps to do an ssh on the node and inspect processes/logs. It is common, for instance, to see torque/maui having stale resource reservations on a node which has rebooted not long ago! Somehow, the system fails to catch up with the true client status. I recommend you send us also the output/screenshot of this tool: http://fotis.web.cern.ch/fotis/QTOP/ # on big sites try: qtop|less -RS qtop was written exactly because of seeing things like that. Packages for multiple distributions are available from here: http://download.opensuse.org/repositories/home:/georgatos/ (redhat, suse, debian, ubuntu... ; that build service is just great) best, Fotis -------- Original Message -------- Subject: torqueusers Digest, Vol 89, Issue 4 Date: Fri, 2 Dec 2011 09:39:34 -0700 From: Reply-To: To: Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. pbsnodes still show node state=free with all np assigned (Shaomin Hu) ---------------------------------------------------------------------- Message: 1 Date: Thu, 1 Dec 2011 14:43:34 -0500 From: Shaomin Hu Subject: [torqueusers] pbsnodes still show node state=free with all np assigned To: torqueusers at supercluster.org Message-ID: Content-Type: text/plain; charset="iso-8859-1" We are running Torque v3.0.2. We run a 5-node job on nodes carter-a631, a630, a629, a628 and a615. We are running Maui scheduler. All 16-core on these nodes are assigned to this job. The state on nodes a631, a629, a628 and a615 all show job-exclusive, but on node carter-a630 still shows as state=free. [root at carter-adm accounting]# qstat -a -n1 carter-adm.rcac.purdue.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 1051.carter-adm. mluisier workq P6400 -- 400 640 -- 02:00 Q -- -- 1622.carter-adm. knagara workq STDIN 6957 5 80 -- 04:00 R 02:57 carter-a631/15+carter-a631/14+carter-a631/13+carter-a631/12+carter-a631/11+carter-a631/10+carter-a631/9+carter-a631/8+carter-a631/7+carter-a631/6+carter-a631/5+carter-a631/4+carter-a631/3+carter-a631/2+carter-a631/1+carter-a631/0+carter-a630/15+carter-a630/14+carter-a630/13+carter-a630/12+carter-a630/11+carter-a630/10+carter-a630/9+carter-a630/8+carter-a630/7+carter-a630/6+carter-a630/5+carter-a630/4+carter-a630/3+carter-a630/2+carter-a630/1+carter-a630/0+carter-a629/15+carter-a629/14+carter-a629/13+carter-a629/12+carter-a629/11+carter-a629/10+carter-a629/9+carter-a629/8+carter-a629/7+carter-a629/6+carter-a629/5+carter-a629/4+carter-a629/3+carter-a629/2+carter-a629/1+carter-a629/0+carter-a628/15+carter-a628/14+carter-a628/13+carter-a628/12+carter-a628/11+carter-a628/10+carter-a628/9+carter-a628/8+carter-a628/7+carter-a628/6+carter-a628/5+carter-a628/4+carter-a628/3+carter-a628/2+carter-a628/1+carter-a628/0+carter-a615/15+carter-a615/14+carter-a615/13+carter-a615/12+carter-a61 5/11+carter-a615/10+carter-a615/9+carter-a615/8+carter-a615/7+carter-a615/6+carter-a615/5+carter-a615/4+carter-a615/3+carter-a615/2+carter-a615/1+carter-a615/0 1625.carter-adm. hu8 workq submit.pbs -- 400 640 -- 04:00 Q -- -- [root at carter-adm accounting]# qstat -f 1622 Job Id: 1622.carter-adm.rcac.purdue.edu Job_Name = STDIN Job_Owner = knagara at carter-fe00.rcac.purdue.edu resources_used.cput = 19:37:42 resources_used.mem = 90248520kb resources_used.vmem = 116310764kb resources_used.walltime = 02:57:56 job_state = R queue = workq server = carter-adm.rcac.purdue.edu Checkpoint = u ctime = Thu Dec 1 11:22:01 2011 Error_Path = /dev/pts/0 exec_host = carter-a631/15+carter-a631/14+carter-a631/13+carter-a631/12+ca rter-a631/11+carter-a631/10+carter-a631/9+carter-a631/8+carter-a631/7+ carter-a631/6+carter-a631/5+carter-a631/4+carter-a631/3+carter-a631/2+ carter-a631/1+carter-a631/0+carter-a630/15+carter-a630/14+carter-a630/ 13+carter-a630/12+carter-a630/11+carter-a630/10+carter-a630/9+carter-a 630/8+carter-a630/7+carter-a630/6+carter-a630/5+carter-a630/4+carter-a 630/3+carter-a630/2+carter-a630/1+carter-a630/0+carter-a629/15+carter- a629/14+carter-a629/13+carter-a629/12+carter-a629/11+carter-a629/10+ca rter-a629/9+carter-a629/8+carter-a629/7+carter-a629/6+carter-a629/5+ca rter-a629/4+carter-a629/3+carter-a629/2+carter-a629/1+carter-a629/0+ca rter-a628/15+carter-a628/14+carter-a628/13+carter-a628/12+carter-a628/ 11+carter-a628/10+carter-a628/9+carter-a628/8+carter-a628/7+carter-a62 8/6+carter-a628/5+carter-a628/4+carter-a628/3+carter-a628/2+carter-a62 8/1+carter-a628/0+carter-a615/15+carter-a615/14+carter-a615/13+carter- a615/12+carter-a615/11+carter-a615/10+carter-a615/9+carter-a615/8+cart er-a615/7+carter-a615/6+carter-a615/5+carter-a615/4+carter-a615/3+cart er-a615/2+carter-a615/1+carter-a615/0 exec_port = 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15 003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+ 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500 3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15 003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+ 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500 3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003 Hold_Types = n interactive = True Join_Path = n Keep_Files = n Mail_Points = a mtime = Thu Dec 1 11:23:10 2011 Output_Path = /dev/pts/0 Priority = 0 qtime = Thu Dec 1 11:22:01 2011 Rerunable = False Resource_List.neednodes = 5:ppn=16 Resource_List.nodect = 5 Resource_List.nodes = 5:ppn=16 Resource_List.walltime = 04:00:00 session_id = 6957 substate = 42 Variable_List = PBS_O_QUEUE=workq,PBS_O_HOME=/home/ba01/u111/knagara, PBS_O_LANG=C,PBS_O_LOGNAME=knagara, PBS_O_PATH=/usr/lib64/qt-3.3/bin:/opt/platform_mpi/bin:/usr/local/bin :/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/clustertest/bin:/o pt/cuda/bin:/opt/cuda/C/bin/linux/release:/opt/hpss/bin:/opt/hsi/bin:/ opt/bin:/usr/pbs/bin,PBS_O_MAIL=/var/spool/mail/knagara, PBS_O_SHELL=/usr/local/bin/bash, PBS_O_HOST=carter-fe00.rcac.purdue.edu, PBS_SERVER=carter-adm.rcac.purdue.edu, PBS_O_WORKDIR=/home/ba01/u111/knagara euser = knagara egroup = itap hashname = 1622.carter-adm.rcac.purdue.edu queue_rank = 109 queue_type = E etime = Thu Dec 1 11:22:01 2011 submit_args = -l nodes=5:ppn=16 -I start_time = Thu Dec 1 11:22:14 2011 Walltime.Remaining = 3706 start_count = 1 fault_tolerant = False submit_host = carter-fe00.rcac.purdue.edu init_work_dir = /home/ba01/u111/knagara [root at carter-adm accounting]# pbsnodes carter-a615 carter-a615 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767209,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2960692084,gres=,loadave=0.00,ncpus=16,physmem=32841344kb,availmem=48544800kb,totmem=49618552kb,idletime=97455,nusers=1,nsessions=1,sessions=10669,uname=Linux carter-a615.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a628 carter-a628 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767252,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2959716791,gres=,loadave=0.31,ncpus=16,physmem=32841344kb,availmem=48523232kb,totmem=49618552kb,idletime=97387,nusers=0,nsessions=0,uname=Linux carter-a628.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a629 carter-a629 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767259,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2958375729,gres=,loadave=0.00,ncpus=16,physmem=32841344kb,availmem=48550744kb,totmem=49618552kb,idletime=97396,nusers=0,nsessions=0,uname=Linux carter-a629.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a630 carter-a630 state = free np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767263,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2959950109,gres=,loadave=0.01,ncpus=16,physmem=32841344kb,availmem=48526672kb,totmem=49618552kb,idletime=97399,nusers=0,nsessions=0,uname=Linux carter-a630.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# pbsnodes carter-a631 carter-a631 state = job-exclusive np = 16 properties = carter ntype = cluster jobs = 0/1622.carter-adm.rcac.purdue.edu, 1/ 1622.carter-adm.rcac.purdue.edu, 2/1622.carter-adm.rcac.purdue.edu, 3/ 1622.carter-adm.rcac.purdue.edu, 4/1622.carter-adm.rcac.purdue.edu, 5/ 1622.carter-adm.rcac.purdue.edu, 6/1622.carter-adm.rcac.purdue.edu, 7/ 1622.carter-adm.rcac.purdue.edu, 8/1622.carter-adm.rcac.purdue.edu, 9/ 1622.carter-adm.rcac.purdue.edu, 10/1622.carter-adm.rcac.purdue.edu, 11/ 1622.carter-adm.rcac.purdue.edu, 12/1622.carter-adm.rcac.purdue.edu, 13/ 1622.carter-adm.rcac.purdue.edu, 14/1622.carter-adm.rcac.purdue.edu, 15/ 1622.carter-adm.rcac.purdue.edu status = rectime=1322767255,varattr=,jobs= 1622.carter-adm.rcac.purdue.edu ,state=free,netload=2883619488,gres=,loadave=0.00,ncpus=16,physmem=32841344kb,availmem=48547336kb,totmem=49618552kb,idletime=97387,nusers=1,nsessions=1,sessions=6957,uname=Linux carter-a631.rcac.purdue.edu 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at carter-adm accounting]# The node definition are as follows, carter-a614 np=16 carter carter-a615 np=16 carter carter-a616 np=16 carter carter-a617 np=16 carter carter-a618 np=16 carter carter-a619 np=16 carter carter-a620 np=16 carter carter-a621 np=16 carter carter-a622 np=16 carter carter-a623 np=16 carter carter-a624 np=16 carter carter-a625 np=16 carter carter-a626 np=16 carter carter-a627 np=16 carter carter-a628 np=16 carter carter-a629 np=16 carter carter-a630 np=16 carter carter-a631 np=16 carter carter-a632 np=16 carter Any users have the similar issue? Thanks, Shaomin -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111201/b1d2cb1f/attachment.html ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 89, Issue 4 ****************************************** From samuel at unimelb.edu.au Sun Dec 4 16:45:06 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 05 Dec 2011 10:45:06 +1100 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4ED8811E.6060706@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> Message-ID: <4EDC0602.8030008@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hiya! On 02/12/11 18:41, Hongsheng Zhao wrote: > I've try to find the pbs_mom.conf on the management node of my cluster > but failed. Any hints? There may not be one by default, we start our pbs_mom's with: /usr/local/torque/latest/sbin/pbs_mom -p -c /usr/local/etc/pbs_mom.conf The -c option tells it where to find the config file and the -p option tells it to inherit already running jobs, instead of killing them off. Our /usr/local is shared across the system, so it's all kept nicely in sync. > Thanks a lot for quickly and helpful reply ;-) Pleasure! Sorry for the delay in this reply, it was the weekend here and had a lot on at home.. Best of luck, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7cBgIACgkQO2KABBYQAh/ACACfaopXqoN680VfMiyuJYpUX/KY TFIAniOLmLC6wFnEpBwdVktyvQQ4+NhN =xZIo -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Sun Dec 4 16:52:31 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 05 Dec 2011 10:52:31 +1100 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: References: <4ED86182.2000202@yahoo.com.cn> <4ED866C5.7020803@unimelb.edu.au> <4ED8811E.6060706@yahoo.com.cn> <4ED88B08.1080807@yahoo.com.cn> <4ED9A45B.3050201@yahoo.com.cn> Message-ID: <4EDC07BF.3060306@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/12/11 00:18, Gustavo Correa wrote: > However, to run parallel MPI programs you most likely > need to be able to ssh across *any pair* of nodes > without password. I'd suggest two more preferable options (IMHO): 1) Use Open-MPI and build it with the --with-tm option so it has native Torque support built in. 2) If you are stuck with an MPI that doesn't support Torque (like Intel's) then build the OSC mpiexec program here: http://www.osc.edu/~djohnson/mpiexec/index.php The major benefits of either of these approaches are: a) It automatically finds out on which nodes to run and how many cores on each it has been assigned. b) No need to rsh/ssh between nodes. c) If you are using Torque's cpusets support then it will end up in the cpuset where it is meant to be, so you won't get stomped on by other users jobs. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7cB78ACgkQO2KABBYQAh8DJQCeP7CDTJvriHHvFrT4hEDPFRgu WpEAnRGjckA+NghEjqOjyqBwDccdCPmE =daXG -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Sun Dec 4 17:15:47 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 05 Dec 2011 11:15:47 +1100 Subject: [torqueusers] pbsnodes still show node state=free with all np assigned In-Reply-To: <4EDA449C.3020101@cern.ch> References: <4ED9D4A5.7050202@cern.ch> <4EDA449C.3020101@cern.ch> Message-ID: <4EDC0D33.8090104@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/12/11 02:47, Fotis Georgatos wrote: > It is common, for instance, to see torque/maui having stale resource > reservations on a node which has rebooted not long ago! Do you have: set server mom_job_sync = True in your pbs_servers config ? - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7cDTIACgkQO2KABBYQAh/erACcCxI3QXL7vA76ShpjytvJe02z zAIAnAt8rH+6FwML3WyM4auFqlM/90g6 =lEDs -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Sun Dec 4 21:31:28 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 05 Dec 2011 15:31:28 +1100 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> Message-ID: <4EDC4920.2080006@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/12/11 09:09, Ricardo Rom?n Brenes wrote: > yeah pal! i jsut read that somewhere else! im going to > upgrade my MPICH and then try to compile OSC mpiexec =) Better to just Open-MPI instead and build it with Torque support via the TM API. If it's good enough for the 'K' machine (current #1 on the Top500 and the first 10PF machine) then it should be OK for you too.. ;-) cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7cSSAACgkQO2KABBYQAh+o0wCeNozpKEFqHD1kBM0/vSDwItXZ tZ0AnjTt64dyxCZ/5IRm7Pnhj6AubZET =8dNg -----END PGP SIGNATURE----- From chemprof89 at gmail.com Sat Dec 3 15:46:15 2011 From: chemprof89 at gmail.com (Clarke Earley) Date: Sat, 03 Dec 2011 17:46:15 -0500 Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible nodes Message-ID: <4EDAA6B7.1040106@gmail.com> I am in the process of setting up Torque and Maui on a 2 node cluster running under Debian. Compilation and installation ran without an problems and submission of simple test jobs ( $ echo "sleep 30" | qsub ) also ran without any issue. However, when I try to specify multiple nodes, the jobs fail as follows. > $ echo "sleep 30" | qsub -l nodes=2:npp=2 -q batch > qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodes requirement > $ echo "sleep 30" | qsub -l nodes=1:npp=2 -q batch > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty or all systems are busy) The file server_priv/nodes file exists on the master node (thebrain): > $ cat /var/spool/torque/server_priv/nodes > thebrain np=12 > yakko np=12 and appears to be recognized by pbsnodes > $ pbsnodes > thebrain > state = free > np = 12 > ntype = cluster > status = rectime=1322943942,varattr=,jobs=,state=free,netload=905815039,gres=,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=97029336kb,totmem=97457404kb,idletime=1692,nusers=2,nsessions=11,sessions=19903 24143 24149 24150 24151 24152 24896 24902 24903 24904 24905,uname=Linux chem 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 x86_64,opsys=linux > gpus = 0 > yakko > state = free > np = 12 > ntype = cluster > status = rectime=1322943940,varattr=,jobs=,state=free,netload=242901050,gres=,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=56432072kb,totmem=56736504kb,idletime=3087,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux yakko 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 x86_64,opsys=linux > gpus = 0 The output of qstat appears to indicate that resources are available: > $ qstat -Qf > Queue: batch > queue_type = Execution > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > resources_max.ncpus = 4 > resources_max.nodes = 2 > resources_max.procct = 24 > resources_default.nodes = 1 > resources_default.walltime = 01:00:00 > mtime = 1322940014 > resources_available.ncpus = 4 > resources_available.nodes = 2 > resources_available.procct = 24 > resources_assigned.nodect = 0 > enabled = True > started = True I did not see anything in the log files that appeared helpful. Any suggestions would be most appreciated. Thank you in advance for your help. From jjc at iastate.edu Mon Dec 5 13:02:34 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 5 Dec 2011 20:02:34 +0000 Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible nodes In-Reply-To: <4EDAA6B7.1040106@gmail.com> References: <4EDAA6B7.1040106@gmail.com> Message-ID: <242421BFAF465844BE24EB90BB97E221017DDE2F@ITSDAG3D.its.iastate.edu> I think that you have a typo. Try using ppn=2 rather than npp=2 >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Clarke Earley >Sent: Saturday, December 03, 2011 4:46 PM >To: torqueusers at supercluster.org >Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible >nodes > >I am in the process of setting up Torque and Maui on a 2 node >cluster >running under Debian. Compilation and installation ran without an >problems >and submission of simple test jobs ( $ echo "sleep 30" | qsub ) also >ran >without any issue. However, when I try to specify multiple nodes, >the >jobs fail as follows. > > > $ echo "sleep 30" | qsub -l nodes=2:npp=2 -q batch > > qsub: Job exceeds queue resource limits MSG=cannot >satisfy >queue max nodes requirement > > $ echo "sleep 30" | qsub -l nodes=1:npp=2 -q batch > > qsub: Job exceeds queue resource limits MSG=cannot >locate >feasible nodes (nodes file is empty or all systems are busy) > >The file server_priv/nodes file exists on the master node >(thebrain): > > $ cat /var/spool/torque/server_priv/nodes > > thebrain np=12 > > yakko np=12 > >and appears to be recognized by pbsnodes > > $ pbsnodes > > thebrain > > state = free > > np = 12 > > ntype = cluster > > status = >rectime=1322943942,varattr=,jobs=,state=free,netload=905815039,gres= >,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=97029336kb,totmem >=97457404kb,idletime=1692,nusers=2,nsessions=11,sessions=19903 >24143 24149 24150 24151 24152 24896 24902 24903 24904 >24905,uname=Linux >chem 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 >x86_64,opsys=linux > > gpus = 0 > > > yakko > > state = free > > np = 12 > > ntype = cluster > > status = >rectime=1322943940,varattr=,jobs=,state=free,netload=242901050,gres= >,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=56432072kb,totmem >=56736504kb,idletime=3087,nusers=0,nsessions=? >0,sessions=? 0,uname=Linux yakko 3.0.0-1-amd64 #1 SMP Sat Aug 27 >16:21:11 UTC 2011 x86_64,opsys=linux > > gpus = 0 > >The output of qstat appears to indicate that resources are >available: > > $ qstat -Qf > > Queue: batch > > queue_type = Execution > > total_jobs = 0 > > state_count = Transit:0 Queued:0 Held:0 Waiting:0 >Running:0 Exiting:0 > > resources_max.ncpus = 4 > > resources_max.nodes = 2 > > resources_max.procct = 24 > > resources_default.nodes = 1 > > resources_default.walltime = 01:00:00 > > mtime = 1322940014 > > resources_available.ncpus = 4 > > resources_available.nodes = 2 > > resources_available.procct = 24 > > resources_assigned.nodect = 0 > > enabled = True > > started = True > >I did not see anything in the log files that appeared helpful. Any >suggestions would be most appreciated. Thank you in advance for >your help. > > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From zhaohscas at yahoo.com.cn Mon Dec 5 22:38:31 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Tue, 06 Dec 2011 13:38:31 +0800 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> Message-ID: <4EDDAA57.4090308@yahoo.com.cn> On 12/03/2011 09:48 PM, Gustavo Correa wrote: > Hi Hongsheng > > Answers below > > On Dec 3, 2011, at 12:09 AM, Hongsheng Zhao wrote: > >> On 12/03/2011 12:57 AM, Gustavo Correa wrote: [snipped] >> >>> Your $TORQUE/server_priv/nodes file should contain the list of nodes, >>> the number of CPUs on each, and perhaps their properties [if they're different from each other]. >>> Something like: >>> node01 np=8 >>> ... >>> >>> [Here $TORQUE is the directory where you installed Torque/PBS.] >>> >>> Some emails ago you seem to have said the nodes file was under acl_XXX, I am not sure, >>> but check it out to make sure the nodes file is in the right location. >> >> I only have one nodes file in the following location: >> >> /opt/gridview/pbs/dispatcher/server_priv/nodes > > Right, and this is on node32, correct? Yes, this file is only located on node32 which is assigned as the management node. While the pbs is installed on all of the computation nodes and management node, i.e., I've the following directory on all these nodes: /opt/gridview/pbs/ > >> >> And the contents is as follows currently: >> >> ------ >> node32:/opt/gridview/pbs/dispatcher/server_priv # cat >> /opt/gridview/pbs/dispatcher/server_priv/nodes >> node1 np=8 >> node2 np=8 >> node3 np=8 >> node4 np=8 >> node5 np=8 >> node6 np=8 >> node7 np=8 >> node8 np=8 >> node9 np=8 >> node11 np=8 >> node10 np=3 >> node12 np=8 >> node13 np=8 >> node14 np=8 >> node15 np=8 >> node16 np=8 >> node17 np=8 >> node18 np=8 >> node19 np=8 >> node20 np=8 >> node21 np=8 >> node22 np=8 >> node23 np=8 >> node24 np=8 >> node25 np=8 >> node26 np=8 >> node27 np=8 >> node28 np=8 >> node29 np=8 >> node30 np=8 >> node33 np=16 >> node32:/opt/gridview/pbs/dispatcher/server_priv # >> -------- >> > > Somehow you have the line: > node10 np=3 > Is this a typo on your email? > A typo on your nodes file, perhaps? > Is node10 different from the others and has 3 cores instead of 8? > Should it be perhaps this? > node10 np=8 > Likewise for node33, but my guess is that nod33 is actually bigger than the other nodes, right? To be frankly, this file is given by the vendor. I want to know how to determine the value of np accurately? Can I determine it based on the output of /proc/cpuinfo for a specific node. Say, on the node33, I can obtain the following informations: ------------ node33:~ # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4529.08 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.32 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4517.39 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.33 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.31 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.35 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.32 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 8 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4524.65 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 9 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.32 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 10 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.44 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.29 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 12 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.39 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 13 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4523.18 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 14 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.28 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 1 siblings : 8 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm bogomips : 4522.37 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: node33:~ # ------------- > > Also, if you are not going to use node32 for computations, i.e. like a regular compute > node, you should remove it from the nodes file above. > On the other hand, if you want to use node32 also for computations, > it should be part of the nodes file. > In that case you need to run pbs_mom *also* on node32. > In any case, they syntax on the line for node32 doesn't look right [no np=8 or similar, and the > path name :/opt/gridview/pbs/dispatcher/server_priv problably doesn't belong there either]. > If you want to use node32 for computations, maybe you could use np=4 [node32 np=4], > to leave some cores available for administrative tasks, user login sessions, compilation, etc. Thanks for your kindly analyze, sorry for that, the output is confused by the terminal's return carriage and new line feed. The actual lines for the nodes file are as follows: ----- node1 np=8 node2 np=8 node3 np=8 node4 np=8 node5 np=8 node6 np=8 node7 np=8 node8 np=8 node9 np=8 node11 np=8 node10 np=3 node12 np=8 node13 np=8 node14 np=8 node15 np=8 node16 np=8 node17 np=8 node18 np=8 node19 np=8 node20 np=8 node21 np=8 node22 np=8 node23 np=8 node24 np=8 node25 np=8 node26 np=8 node27 np=8 node28 np=8 node29 np=8 node30 np=8 node33 np=16 ------- > > On node32, what is the output of 'pbsnodes' ? > Does it list all nodes in your nodes file, with the correct np? We don't use this node as the for computations, just as it has been said in the above nodes file ;-) Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From Gareth.Williams at csiro.au Mon Dec 5 23:10:53 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Tue, 6 Dec 2011 17:10:53 +1100 Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible nodes In-Reply-To: <242421BFAF465844BE24EB90BB97E221017DDE2F@ITSDAG3D.its.iastate.edu> References: <4EDAA6B7.1040106@gmail.com> <242421BFAF465844BE24EB90BB97E221017DDE2F@ITSDAG3D.its.iastate.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C6360AC8@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Coyle, James J [ITACD] [mailto:jjc at iastate.edu] > Sent: Tuesday, 6 December 2011 7:03 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Beginner Problem: MSG=cannot locate feasible > nodes > > I think that you have a typo. > > Try using ppn=2 rather than npp=2 You are also getting ncpus and procct set from the default_max numbers. This might be OK but might be problematic. I'd avoid ncpus but procct is probably OK as I think it gets stripped from the job as it is started anyway. All: is this a reasonable MSG? Would it be hard to make the feedback more direct in this case? Is npp=2 in this context clearly an error or could it be meaningful in a 'real' cluster configuration? Gareth > > > > > > >-----Original Message----- > >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > >bounces at supercluster.org] On Behalf Of Clarke Earley > >Sent: Saturday, December 03, 2011 4:46 PM > >To: torqueusers at supercluster.org > >Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible > >nodes > > > >I am in the process of setting up Torque and Maui on a 2 node > >cluster > >running under Debian. Compilation and installation ran without an > >problems > >and submission of simple test jobs ( $ echo "sleep 30" | qsub ) also > >ran > >without any issue. However, when I try to specify multiple nodes, > >the > >jobs fail as follows. > > > > > $ echo "sleep 30" | qsub -l nodes=2:npp=2 -q batch > > > qsub: Job exceeds queue resource limits MSG=cannot > >satisfy > >queue max nodes requirement > > > $ echo "sleep 30" | qsub -l nodes=1:npp=2 -q batch > > > qsub: Job exceeds queue resource limits MSG=cannot > >locate > >feasible nodes (nodes file is empty or all systems are busy) > > > >The file server_priv/nodes file exists on the master node > >(thebrain): > > > $ cat /var/spool/torque/server_priv/nodes > > > thebrain np=12 > > > yakko np=12 > > > >and appears to be recognized by pbsnodes > > > $ pbsnodes > > > thebrain > > > state = free > > > np = 12 > > > ntype = cluster > > > status = > >rectime=1322943942,varattr=,jobs=,state=free,netload=905815039,gres= > >,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=97029336kb,totmem > >=97457404kb,idletime=1692,nusers=2,nsessions=11,sessions=19903 > >24143 24149 24150 24151 24152 24896 24902 24903 24904 > >24905,uname=Linux > >chem 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 > >x86_64,opsys=linux > > > gpus = 0 > > > > > yakko > > > state = free > > > np = 12 > > > ntype = cluster > > > status = > >rectime=1322943940,varattr=,jobs=,state=free,netload=242901050,gres= > >,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=56432072kb,totmem > >=56736504kb,idletime=3087,nusers=0,nsessions=? > >0,sessions=? 0,uname=Linux yakko 3.0.0-1-amd64 #1 SMP Sat Aug 27 > >16:21:11 UTC 2011 x86_64,opsys=linux > > > gpus = 0 > > > >The output of qstat appears to indicate that resources are > >available: > > > $ qstat -Qf > > > Queue: batch > > > queue_type = Execution > > > total_jobs = 0 > > > state_count = Transit:0 Queued:0 Held:0 Waiting:0 > >Running:0 Exiting:0 > > > resources_max.ncpus = 4 > > > resources_max.nodes = 2 > > > resources_max.procct = 24 > > > resources_default.nodes = 1 > > > resources_default.walltime = 01:00:00 > > > mtime = 1322940014 > > > resources_available.ncpus = 4 > > > resources_available.nodes = 2 > > > resources_available.procct = 24 > > > resources_assigned.nodect = 0 > > > enabled = True > > > started = True > > > >I did not see anything in the log files that appeared helpful. Any > >suggestions would be most appreciated. Thank you in advance for > >your help. > > > > > >_______________________________________________ > >torqueusers mailing list > >torqueusers at supercluster.org > >http://www.supercluster.org/mailman/listinfo/torqueusers From zhaohscas at yahoo.com.cn Tue Dec 6 01:05:56 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Tue, 06 Dec 2011 16:05:56 +0800 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4EDDAA57.4090308@yahoo.com.cn> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> Message-ID: <4EDDCCE4.8080602@yahoo.com.cn> On 12/06/2011 01:38 PM, Hongsheng Zhao wrote: > To be frankly, this file is given by the vendor. I want to know how to > determine the value of np accurately? Can I determine it based on the > output of /proc/cpuinfo for a specific node. Say, on the node33, I can > obtain the following informations: It seems the following command will give me the value of np for a specific node, say, for the node33: zhaohongsheng at node33:~> grep processor /proc/cpuinfo | wc -l 16 Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Tue Dec 6 01:44:34 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Tue, 06 Dec 2011 16:44:34 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu> References: <4ED86182.2000202@yahoo.com.cn> <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu> Message-ID: <4EDDD5F2.7010507@yahoo.com.cn> On 12/03/2011 12:40 AM, Gustavo Correa wrote: > Stdout and stderr by default stay in the first node on your list, > until the job ends. At that point they're transferred to the work directory. > Follow Christopher Samuel's recommendation for the pbs_mom.conf files [on all nodes!] > I hope this helps, > Gus Correa Thanks a lot for all of the helps here! Finally, I've found that the issue is caused by incorrect entries fro node32 in the /etc/hosts file on my cluster. For detail, see the following: Originally, the lines relevant to node32 in the /etc/hosts file are the following lines: --------- 192.168.1.32 node32.nxu.edu.cn 10.10.10.32 ibnode32 10.10.10.32 node32.nxu.edu.cn 202.201.128.36 node32.nxu.edu.cn node32 ---------- The network 10.10.10.x for my case is a infiniband network, the 192.168.1.x is a ethernet network. In my case, each node has two set of network settings. In addition, the node32 is assigned as the management node, so it should have a public IP address, i.e., 202.201.128.36, for access the cluster from the internet. Currently, I've changed the above lines into the following form: -------- 192.168.1.32 node32 10.10.10.32 ibnode32 202.201.128.36 node32.nxu.edu.cn node32 --------- It solved the issue I posted here. Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Tue Dec 6 03:51:52 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Tue, 06 Dec 2011 18:51:52 +0800 Subject: [torqueusers] The issue when runing dmol3. Message-ID: <4EDDF3C8.4030103@yahoo.com.cn> Hi all, I run the dmol3 of MaterialsStudio55 standalonely by using pbs queueing system. In the pbs script for my job, I use the following lines: ------- # Set filenames for PBS to log standard error and standard output. #PBS -o stdout #PBS -e stderr --------- After the job has been finished, I find the following information in the stderr file in the current directory: ------------- zhaohongsheng at node32:~/work/Dr.Zhao/dmol3_test> cat stderr cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk': No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_'cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory: No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory : No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory zhaohongsheng at node32:~/work/Dr.Zhao/dmol3_test> -------------- But, it seems that the job has been preformed successfully. I cann't figure out why does this happen. Any hints will highly appreciated. Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From rsvancara at wsu.edu Tue Dec 6 04:27:54 2011 From: rsvancara at wsu.edu (Svancara, Randall) Date: Tue, 6 Dec 2011 11:27:54 +0000 Subject: [torqueusers] The issue when runing dmol3. In-Reply-To: <4EDDF3C8.4030103@yahoo.com.cn> References: <4EDDF3C8.4030103@yahoo.com.cn> Message-ID: <1F880D7A2494B346B5AB96481EAE704A08465F@EXMB-03.ad.wsu.edu> Just a quick question, but do you have permissions to write to those directories? Otherwise, I would consider contacting MaterialStudio55 for support regarding the problem. Thanks, Randall ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Hongsheng Zhao [zhaohscas at yahoo.com.cn] Sent: Tuesday, December 06, 2011 2:51 AM To: Torque Users Mailing List Subject: [torqueusers] The issue when runing dmol3. Hi all, I run the dmol3 of MaterialsStudio55 standalonely by using pbs queueing system. In the pbs script for my job, I use the following lines: ------- # Set filenames for PBS to log standard error and standard output. #PBS -o stdout #PBS -e stderr --------- After the job has been finished, I find the following information in the stderr file in the current directory: ------------- zhaohongsheng at node32:~/work/Dr.Zhao/dmol3_test> cat stderr cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk': No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_'cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory: No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory : No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_'cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': No such file or directory : No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': No such file or directory cp: cannot stat `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': No such file or directory zhaohongsheng at node32:~/work/Dr.Zhao/dmol3_test> -------------- But, it seems that the job has been preformed successfully. I cann't figure out why does this happen. Any hints will highly appreciated. Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From rsvancara at wsu.edu Tue Dec 6 04:40:11 2011 From: rsvancara at wsu.edu (Svancara, Randall) Date: Tue, 6 Dec 2011 11:40:11 +0000 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4EDDD5F2.7010507@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu>, <4EDDD5F2.7010507@yahoo.com.cn> Message-ID: <1F880D7A2494B346B5AB96481EAE704A084678@EXMB-03.ad.wsu.edu> Your /etc/hosts file could be very problematic with the way you have it configured. If I am on your cluster, and I want to access node32.nxu.edu.cn, how can your systems resolve the correct address? Why dont you use something like this to eliminate confusion. 192.168.1.32 node32.local node32 10.10.10.32 node32-ib.local node32-ib 202.201.128.36 headnode.nxu.edu.cn headnode This way your systems have unique host names for all the interfaces on on your head node. Make sure the /etc/hosts file is synchronized between all your nodes. Tnanks Randall ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Hongsheng Zhao [zhaohscas at yahoo.com.cn] Sent: Tuesday, December 06, 2011 12:44 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] Cann't generate the PBS log files. On 12/03/2011 12:40 AM, Gustavo Correa wrote: > Stdout and stderr by default stay in the first node on your list, > until the job ends. At that point they're transferred to the work directory. > Follow Christopher Samuel's recommendation for the pbs_mom.conf files [on all nodes!] > I hope this helps, > Gus Correa Thanks a lot for all of the helps here! Finally, I've found that the issue is caused by incorrect entries fro node32 in the /etc/hosts file on my cluster. For detail, see the following: Originally, the lines relevant to node32 in the /etc/hosts file are the following lines: --------- 192.168.1.32 node32.nxu.edu.cn 10.10.10.32 ibnode32 10.10.10.32 node32.nxu.edu.cn 202.201.128.36 node32.nxu.edu.cn node32 ---------- The network 10.10.10.x for my case is a infiniband network, the 192.168.1.x is a ethernet network. In my case, each node has two set of network settings. In addition, the node32 is assigned as the management node, so it should have a public IP address, i.e., 202.201.128.36, for access the cluster from the internet. Currently, I've changed the above lines into the following form: -------- 192.168.1.32 node32 10.10.10.32 ibnode32 202.201.128.36 node32.nxu.edu.cn node32 --------- It solved the issue I posted here. Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From zhaohscas at yahoo.com.cn Tue Dec 6 05:51:41 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Tue, 06 Dec 2011 20:51:41 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A084678@EXMB-03.ad.wsu.edu> References: <4ED86182.2000202@yahoo.com.cn> <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu>, <4EDDD5F2.7010507@yahoo.com.cn> <1F880D7A2494B346B5AB96481EAE704A084678@EXMB-03.ad.wsu.edu> Message-ID: <4EDE0FDD.7060504@yahoo.com.cn> On 12/06/2011 07:40 PM, Svancara, Randall wrote: > Your /etc/hosts file could be very problematic with the way you have it configured. > If I am on your cluster, and I want to access node32.nxu.edu.cn, how can your systems resolve the correct address? In my case, both 202.201.128.36 and192.168.1.32 are binding to the different network interface cards on node32. Currently, I only tried to use 202.201.128.36 to access the cluster remotely. Furthermore, due to 202.201.128.36 and192.168.1.32 are pointed to a actually same node/host, i.e., node32, so I assign the hostname for node32 that way ;-) Even so, does it still problematic? Could you please give me some more hints based on my above information? > Why dont you use something like this to eliminate confusion. > > 192.168.1.32 node32.local node32 > 10.10.10.32 node32-ib.local node32-ib > 202.201.128.36 headnode.nxu.edu.cn headnode By this way, can I access the headnode from within the internet explorer like something as the follows: http://headnode.nxu.edu.cn:web_port > > This way your systems have unique host names for all the interfaces on on your head node. Make sure the /etc/hosts file is synchronized between all your nodes. Thanks for your hints. I've a script to do the synchronization of hosts files on the whole cluster. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From j.kasiak at gmail.com Tue Dec 6 06:02:01 2011 From: j.kasiak at gmail.com (Jan Kasiak) Date: Tue, 6 Dec 2011 08:02:01 -0500 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4EDDCCE4.8080602@yahoo.com.cn> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> Message-ID: Hi, Thats not actually true. I think half the cores are hyperthreaded. Look up your processor on Intel's website for exact core count. cpuinfo reports logical and not physical cores. It also looks like you have a dual socket node. -Jan (from phone) On Dec 6, 2011 3:06 AM, "Hongsheng Zhao" wrote: > On 12/06/2011 01:38 PM, Hongsheng Zhao wrote: > > To be frankly, this file is given by the vendor. I want to know how to > > determine the value of np accurately? Can I determine it based on the > > output of /proc/cpuinfo for a specific node. Say, on the node33, I can > > obtain the following informations: > > It seems the following command will give me the value of np for a > specific node, say, for the node33: > > zhaohongsheng at node33:~> grep processor /proc/cpuinfo | wc -l > 16 > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/43215d60/attachment-0001.html From gus at ldeo.columbia.edu Tue Dec 6 07:03:06 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 6 Dec 2011 09:03:06 -0500 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4EDDD5F2.7010507@yahoo.com.cn> References: <4ED86182.2000202@yahoo.com.cn> <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu> <4EDDD5F2.7010507@yahoo.com.cn> Message-ID: <71F9B327-ED76-467F-9F09-E76D6241C5D2@ldeo.columbia.edu> Hi Hongsheng Yes, /etc/hosts must be right and homogeneous on all nodes, if you are resolving the node names via /etc/hosts [which is a simple and good solution]. Your fix sounds right. The original file was mixing the external [Internet] and internal addresses and interfaces of node32. Most likely this was driving Torque very confused. Your 10.10.10.0 subnet seems to be for IB (maybe IP over IB), and 192.168.1.0 for Ethernet, with different host names for each node interface, which is a typical setup. Gus Correa On Dec 6, 2011, at 3:44 AM, Hongsheng Zhao wrote: > On 12/03/2011 12:40 AM, Gustavo Correa wrote: >> Stdout and stderr by default stay in the first node on your list, >> until the job ends. At that point they're transferred to the work directory. >> Follow Christopher Samuel's recommendation for the pbs_mom.conf files [on all nodes!] >> I hope this helps, >> Gus Correa > > Thanks a lot for all of the helps here! > > Finally, I've found that the issue is caused by incorrect entries fro > node32 in the /etc/hosts file on my cluster. For detail, see the following: > > Originally, the lines relevant to node32 in the /etc/hosts file are the > following lines: > > --------- > 192.168.1.32 node32.nxu.edu.cn > 10.10.10.32 ibnode32 > 10.10.10.32 node32.nxu.edu.cn > 202.201.128.36 node32.nxu.edu.cn node32 > ---------- > > The network 10.10.10.x for my case is a infiniband network, the > 192.168.1.x is a ethernet network. In my case, each node has two set of > network settings. In addition, the node32 is assigned as the management > node, so it should have a public IP address, i.e., 202.201.128.36, for > access the cluster from the internet. > > Currently, I've changed the above lines into the following form: > > -------- > 192.168.1.32 node32 > 10.10.10.32 ibnode32 > 202.201.128.36 node32.nxu.edu.cn node32 > --------- > > It solved the issue I posted here. > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Tue Dec 6 07:10:56 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 6 Dec 2011 09:10:56 -0500 Subject: [torqueusers] The issue when runing dmol3. In-Reply-To: <4EDDF3C8.4030103@yahoo.com.cn> References: <4EDDF3C8.4030103@yahoo.com.cn> Message-ID: Hi Hongsheng It could be many things. My guess is that /public/home is a directory shared via NFS, maybe physically located in node32, and mounted on all nodes. Maybe a few nodes are not mounting it correctly. You could login to each node and make sure it is being mounted. The /etc/auto.* files are the first thing to look at. I hope this helps. Gus Correa PS - To test your cluster functionality, with Torque and MPI, I would suggest that you run first something very simple. This is a good thing to do before you try more complex applications. MPICH2 comes with a program called cpi.c, and OpenMPI comes with three programs connectivity_c.c, ring_c.c and hello_c.c. I like better connectivity_c.c, because it tests connections across all pairs of nodes. You can download MPICH2 and OpenMPI from the Internet. On Dec 6, 2011, at 5:51 AM, Hongsheng Zhao wrote: > Hi all, > > I run the dmol3 of MaterialsStudio55 standalonely by using pbs queueing > system. > > In the pbs script for my job, I use the following lines: > > ------- > # Set filenames for PBS to log standard error and standard output. > #PBS -o stdout > #PBS -e stderr > --------- > > After the job has been finished, I find the following information in the > stderr file in the current directory: > > ------------- > zhaohongsheng at node32:~/work/Dr.Zhao/dmol3_test> cat stderr > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk'cp: > cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk': > No such file or directory > : No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_'cp: > cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': > No such file or directory > : No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': > No such file or directory > cp: cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_'cannot > stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': > No such file or directory: No such file or directory > > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_'cp: > cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_'cp: > cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': > No such file or directory > : No such file or directory > : No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283591120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d283602120619/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77467120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77423120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d774910120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77478120701/Ge.tpdensk_'cp: > cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77445120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77434120701/Ge.tpdensk_': > No such file or directory > : No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77489120701/Ge.tpdensk_': > No such file or directory > cp: cannot stat > `/public/home/zhaohongsheng/work/Dr.Zhao/dmol3_test/tmp/d77456120701/Ge.tpdensk_': > No such file or directory > zhaohongsheng at node32:~/work/Dr.Zhao/dmol3_test> > -------------- > > But, it seems that the job has been preformed successfully. I cann't > figure out why does this happen. Any hints will highly appreciated. > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Tue Dec 6 07:15:19 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 6 Dec 2011 09:15:19 -0500 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> Message-ID: <5264369E-B90E-4269-80ED-597A93AA6474@ldeo.columbia.edu> Jan is probably right. You can turn off hyperthreading on the BIOS, if you want. A number of people do this for paralell applications. Gus Correa On Dec 6, 2011, at 8:02 AM, Jan Kasiak wrote: > Hi, > > Thats not actually true. I think half the cores are hyperthreaded. Look up your processor on Intel's website for exact core count. cpuinfo reports logical and not physical cores. It also looks like you have a dual socket node. > > -Jan > (from phone) > > On Dec 6, 2011 3:06 AM, "Hongsheng Zhao" wrote: > On 12/06/2011 01:38 PM, Hongsheng Zhao wrote: > > To be frankly, this file is given by the vendor. I want to know how to > > determine the value of np accurately? Can I determine it based on the > > output of /proc/cpuinfo for a specific node. Say, on the node33, I can > > obtain the following informations: > > It seems the following command will give me the value of np for a > specific node, say, for the node33: > > zhaohongsheng at node33:~> grep processor /proc/cpuinfo | wc -l > 16 > > Best regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From biswas.koushik at gmail.com Tue Dec 6 07:33:39 2011 From: biswas.koushik at gmail.com (Koushik Biswas) Date: Tue, 6 Dec 2011 09:33:39 -0500 Subject: [torqueusers] Jobs crash if more nodes selected Message-ID: OK, since I see everybody are so helpful here I thought I could get away by getting some good hints/answers to my problems without having to do my due diligence! I recently installed torque2.5.5 and pbs_sched is my scheduler. I have a 68 node cluster and the following questions/issues: (1) I have intel i7 processors which has 4 cores and this hyperthreading etc. In the "nodes" file I have np=4. Should I use np=8? Could it have an performance benefit? (2) I run a code called VASP. I see when I select 2 node for a job the job run seeming OK. If I select more say 4 nodes the job crashes and some of the nodes even go down. I have not really done much research on this. I understand it could be openmpi or VASP compiling issues. If anyone has seen similar behavior or anything to say about this, please let me know. Thanks, Koushik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/91dd537f/attachment-0001.html From jwbacon at tds.net Tue Dec 6 07:43:43 2011 From: jwbacon at tds.net (Jason Bacon) Date: Tue, 06 Dec 2011 08:43:43 -0600 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <71F9B327-ED76-467F-9F09-E76D6241C5D2@ldeo.columbia.edu> References: <4ED86182.2000202@yahoo.com.cn> <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu> <4EDDD5F2.7010507@yahoo.com.cn> <71F9B327-ED76-467F-9F09-E76D6241C5D2@ldeo.columbia.edu> Message-ID: <4EDE2A1F.9010902@tds.net> On 12/06/11 08:03, Gustavo Correa wrote: > Hi Hongsheng > > Yes, /etc/hosts must be right and homogeneous on all nodes, > if you are resolving the node names via /etc/hosts [which is a simple and good solution]. Good advice, but just to prevent someone from taking this too literally, there is one exception: If your head node is acting as a gateway, you have to be sure that the hostname used by Torque is bound to the primary interface, or pbs_server will get confused. The admin manual mentions the SERVERHOST parameter for multihomed hosts, but I had no luck with it even after significant effort and experimenting. Ultimately, I bound the hostname of my head node to the external address on the head node and to the internal address on all the compute nodes and everything runs smoothly now. Head node/gateway: 129.89.25.224 peregrine.hpc.uwm.edu peregrine # Local names and addresses 192.168.0.3 compute-01.local compute-01 192.168.0.4 compute-02.local compute-02 Compute nodes: 192.168.0.2 peregrine.hpc.uwm.edu peregrine # Local names and addresses 192.168.0.3 compute-01.local compute-01 192.168.0.4 compute-02.local compute-02 Cheers, -J -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From tiago.silva at cefas.co.uk Tue Dec 6 07:29:44 2011 From: tiago.silva at cefas.co.uk (Tiago Silva (Cefas)) Date: Tue, 6 Dec 2011 14:29:44 -0000 Subject: [torqueusers] hydra *and* mpd In-Reply-To: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk> References: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk> Message-ID: <04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk> Hi We have a 20 node cluster with rocks 5.3 and mpich2 1.3.1. Most users use mpiexec with hydra, but one of our models requires a second version of mpich that was compiled differently and we submit these jobs using mpirun and an mpd ring. We have rocks 5.3 but due to lack of foresight torque wasn't installed when the cluster was built. We are planning to install torque on top of the existing installation. Would torque be able to handle two different initialization methods (mpirun and mpiexec)? Thanks, Tiago This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/d2e0e1d0/attachment.html From rsvancara at wsu.edu Tue Dec 6 07:57:03 2011 From: rsvancara at wsu.edu (Svancara, Randall) Date: Tue, 6 Dec 2011 14:57:03 +0000 Subject: [torqueusers] Jobs crash if more nodes selected In-Reply-To: References: Message-ID: <1F880D7A2494B346B5AB96481EAE704A084711@EXMB-03.ad.wsu.edu> One way to find out is to test and see. I would run one job with hyperthreading turned on and one job with it off. Do you have any error messages in your stderr file? Thanks, Randall ________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Koushik Biswas [biswas.koushik at gmail.com] Sent: Tuesday, December 06, 2011 6:33 AM To: Torque Users Mailing List Subject: [torqueusers] Jobs crash if more nodes selected OK, since I see everybody are so helpful here I thought I could get away by getting some good hints/answers to my problems without having to do my due diligence! I recently installed torque2.5.5 and pbs_sched is my scheduler. I have a 68 node cluster and the following questions/issues: (1) I have intel i7 processors which has 4 cores and this hyperthreading etc. In the "nodes" file I have np=4. Should I use np=8? Could it have an performance benefit? (2) I run a code called VASP. I see when I select 2 node for a job the job run seeming OK. If I select more say 4 nodes the job crashes and some of the nodes even go down. I have not really done much research on this. I understand it could be openmpi or VASP compiling issues. If anyone has seen similar behavior or anything to say about this, please let me know. Thanks, Koushik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/b864e412/attachment.html From shaomin.hu at gmail.com Tue Dec 6 08:16:41 2011 From: shaomin.hu at gmail.com (Shaomin Hu) Date: Tue, 6 Dec 2011 10:16:41 -0500 Subject: [torqueusers] syntax for requesting 10 nodes with 1 core on each node exclusively Message-ID: We are using Torque v3.0.2 and want to run a job exclusively on 10 different nodes, 1 core on each node. There are 16 cores on each node. If we use the syntax "-lnodes=10:ppn=1 -n", the job still only get one node exclusively to run, not 10 different nodes. What syntax should we use to get 10 different nodes, 1 core on each node exclusively? Thanks much. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/d3ad836f/attachment.html From jwbacon at tds.net Tue Dec 6 08:25:14 2011 From: jwbacon at tds.net (Jason Bacon) Date: Tue, 06 Dec 2011 09:25:14 -0600 Subject: [torqueusers] Jobs crash if more nodes selected In-Reply-To: References: Message-ID: <4EDE33DA.1080909@tds.net> If nodes are going down, I would first check the memory use of these jobs. There isn't much else that will bring down a Unix system, and this is fairly common on Linux systems which are pretty liberal about overcommitting memory by default. ( Typically by 30% of RAM capacity, so if you have 24G RAM and 4G swap, the system could allow up to 28G + 7.2G of total allocations to succeed. ) You can watch memory use with "top" on one of the compute nodes it gets dispatched to. To avoid this problem, I set low default (soft) limits on pvmem in my Torque config to force users to specify their memory requirements in jobs that require significant memory. If nobody specifies pvmem, they'll all be limited to less than the available RAM/core by the scheduler. If anyone needs more, then the scheduler will know about it, and won't dispatch their processes to a node that doesn't have enough memory. There's still a chance of overallocating, since the scheduler might only do periodic checks on actual use before killing the job, depending on Torque config and OS. If memory use grows quickly, the system could be overcommitted before the scheduler's next check. The prevent any chance of overallocating memory, you have to configure the kernel to use stricter policies. See the following for a thorough explanation: http://www.win.tue.nl/~aeb/linux/lk/lk-9.html Found by searching "linux overcommit memory". Not sure if you're using Linux, so the policies and methods could be quite different for you. Our Torque cluster runs FreeBSD, which is a bit more conservative out-of-the-box, but we've seen some nodes taken down by memory overcommit on our Redhat + LSF cluster. Regards, -J On 12/06/11 08:33, Koushik Biswas wrote: > OK, since I see everybody are so helpful here I thought I could get > away by getting some good hints/answers to my problems without having > to do my due diligence! I recently installed torque2.5.5 and pbs_sched > is my scheduler. I have a 68 node cluster and the following > questions/issues: > (1) I have intel i7 processors which has 4 cores and this > hyperthreading etc. In the "nodes" file I have np=4. Should I use > np=8? Could it have an performance benefit? > (2) I run a code called VASP. I see when I select 2 node for a job the > job run seeming OK. If I select more say 4 nodes the job crashes and > some of the nodes even go down. I have not really done much research > on this. I understand it could be openmpi or VASP compiling issues. If > anyone has seen similar behavior or anything to say about this, please > let me know. > Thanks, > Koushik > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From jdsmit at sandia.gov Tue Dec 6 09:03:31 2011 From: jdsmit at sandia.gov (Smith, Jerry Don II) Date: Tue, 6 Dec 2011 16:03:31 +0000 Subject: [torqueusers] [EXTERNAL] syntax for requesting 10 nodes with 1 core on each node exclusively In-Reply-To: Message-ID: What scheduler are you using? It will depend on that to get the results you are looking for. Jerry We are using Torque v3.0.2 and want to run a job exclusively on 10 different nodes, 1 core on each node. There are 16 cores on each node. If we use the syntax "-lnodes=10:ppn=1 -n", the job still only get one node exclusively to run, not 10 different nodes. What syntax should we use to get 10 different nodes, 1 core on each node exclusively? Thanks much. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/3be3d4fc/attachment-0001.html From akohlmey at cmm.chem.upenn.edu Tue Dec 6 09:49:28 2011 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Tue, 6 Dec 2011 11:49:28 -0500 Subject: [torqueusers] syntax for requesting 10 nodes with 1 core on each node exclusively In-Reply-To: References: Message-ID: On Tue, Dec 6, 2011 at 10:16 AM, Shaomin Hu wrote: > We are using Torque v3.0.2 and want to run a job exclusively on 10 > different nodes, 1 core on each node. There are 16 cores on each node. If > we use the syntax "-lnodes=10:ppn=1 -n", the job still only get one node > exclusively to run, not 10 different nodes. > > What syntax should we use to get 10 different nodes, 1 core on each node > exclusively? > -l nodes=10:ppn=16 if you want to have the node exclusively, you need to request all cores on the node (unless there are some measures in place, that enforce a "one-job-per-node policy"). for the parallel job you just need to set it up so that it uses only one core per node, e.g. using -npernode 1 with OpenMPI. cheers, axel. > > Thanks much. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Dr. Axel Kohlmeyer akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/7eb4a36a/attachment.html From gus at ldeo.columbia.edu Tue Dec 6 10:02:13 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 6 Dec 2011 12:02:13 -0500 Subject: [torqueusers] Jobs crash if more nodes selected In-Reply-To: <4EDE33DA.1080909@tds.net> References: <4EDE33DA.1080909@tds.net> Message-ID: You can login to the nodes and use top to see how many processes and/or threads are executing, how much memory is being used, etc. Besides Jason's suggestions, check also the limits on the nodes, specially stacksize, locked memory, and maybe the max open files. We get by with this in /etc/security/limits.conf: * - memlock -1 * - stack -1 * - nofile 4096 I know nothing about VASP, so these are guesses: Does VASP launch threads [say via OpenMP] from each MPI process? If yes, you may need to account for them, and give a physical core, or perhaps a 'hyperthreaded' core to each thread. This may require reducing the np=4 in your mpiexec command line. If the processes or threads are paging to disk, they you need to reduce np for sure. Worse comes to worst, turn off hyperthreading on the BIOS also. I hope this helps, Gus Correa On Dec 6, 2011, at 10:25 AM, Jason Bacon wrote: > > If nodes are going down, I would first check the memory use of these > jobs. There isn't much else that will bring down a Unix system, and > this is fairly common on Linux systems which are pretty liberal about > overcommitting memory by default. ( Typically by 30% of RAM capacity, > so if you have 24G RAM and 4G swap, the system could allow up to 28G + > 7.2G of total allocations to succeed. ) > > You can watch memory use with "top" on one of the compute nodes it gets > dispatched to. > > To avoid this problem, I set low default (soft) limits on pvmem in my > Torque config to force users to specify their memory requirements in > jobs that require significant memory. If nobody specifies pvmem, > they'll all be limited to less than the available RAM/core by the > scheduler. If anyone needs more, then the scheduler will know about it, > and won't dispatch their processes to a node that doesn't have enough > memory. > > There's still a chance of overallocating, since the scheduler might only > do periodic checks on actual use before killing the job, depending on > Torque config and OS. If memory use grows quickly, the system could be > overcommitted before the scheduler's next check. > > The prevent any chance of overallocating memory, you have to configure > the kernel to use stricter policies. > > See the following for a thorough explanation: > > http://www.win.tue.nl/~aeb/linux/lk/lk-9.html > > Found by searching "linux overcommit memory". Not sure if you're using > Linux, so the policies and methods could be quite different for you. > Our Torque cluster runs FreeBSD, which is a bit more conservative > out-of-the-box, but we've seen some nodes taken down by memory > overcommit on our Redhat + LSF cluster. > > Regards, > > -J > > On 12/06/11 08:33, Koushik Biswas wrote: >> OK, since I see everybody are so helpful here I thought I could get >> away by getting some good hints/answers to my problems without having >> to do my due diligence! I recently installed torque2.5.5 and pbs_sched >> is my scheduler. I have a 68 node cluster and the following >> questions/issues: >> (1) I have intel i7 processors which has 4 cores and this >> hyperthreading etc. In the "nodes" file I have np=4. Should I use >> np=8? Could it have an performance benefit? >> (2) I run a code called VASP. I see when I select 2 node for a job the >> job run seeming OK. If I select more say 4 nodes the job crashes and >> some of the nodes even go down. I have not really done much research >> on this. I understand it could be openmpi or VASP compiling issues. If >> anyone has seen similar behavior or anything to say about this, please >> let me know. >> Thanks, >> Koushik >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Jason W. Bacon > jwbacon at tds.net > http://personalpages.tds.net/~jwbacon > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Tue Dec 6 10:12:39 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 6 Dec 2011 12:12:39 -0500 Subject: [torqueusers] syntax for requesting 10 nodes with 1 core on each node exclusively In-Reply-To: References: Message-ID: <596E761A-51D8-486B-B226-3BD5597262A7@ldeo.columbia.edu> If you use the Maui scheduler, one solution is to set: JOBNODEMATCHPOLICY EXACTNODE in $MAUI/maui.cfg and restart maui. Regardless of which scheduler you use, an alternative is to request all cores, but use less than that. OpenMPI can do that with the -bynode flag. Something like this: #PBS -l nodes=10:ppn=8 ... mpiexec -bynode -np=10 ./my_mpi_program There are other alternatives. For instance, the OSC mpiexec has the flag -pernode, which does pretty much the same thing, and can be used to launch MPI programs compiled with MPICH2. OpenMPI: http://www.open-mpi.org/ OSC mpiexec: http://www.osc.edu/~djohnson/mpiexec/index.php MPICH2: http://www.mcs.anl.gov/research/projects/mpich2/ I hope this helps, Gus Correa On Dec 6, 2011, at 10:16 AM, Shaomin Hu wrote: > We are using Torque v3.0.2 and want to run a job exclusively on 10 different nodes, 1 core on each node. There are 16 cores on each node. If we use the syntax "-lnodes=10:ppn=1 -n", the job still only get one node exclusively to run, not 10 different nodes. > > What syntax should we use to get 10 different nodes, 1 core on each node exclusively? > > Thanks much. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Tue Dec 6 10:28:57 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 6 Dec 2011 17:28:57 +0000 Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible nodes : Suggested error message change. In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102C6360AC8@exvic-mbx04.nexus.csiro.au> References: <4EDAA6B7.1040106@gmail.com> <242421BFAF465844BE24EB90BB97E221017DDE2F@ITSDAG3D.its.iastate.edu> <007DECE986B47F4EABF823C1FBB19C620102C6360AC8@exvic-mbx04.nexus.csiro.au> Message-ID: <242421BFAF465844BE24EB90BB97E221017DDFA5@ITSDAG3D.its.iastate.edu> Gareth, Since you asked about whether the message is good, I'd recommend a change in the message. I've always thought that the np= syntax in the node file and from pbsnodes is inconsistent with the ppn= syntax in the qsub request. Note that in the checks in function static int proplist( ( In the 2.5.4 version which I am running, in the source file server/node_manager.c ) when : is found in the nodes= portion of the job requirements, then if "=" is found in the string following it, then the function checks whether this is one of the "special properties" npp, procs or gpu. 1) if "ppn" , "procs", and "gpus" are found, node_req is checked for a number (a positive integer) if one is not found, "1" is returned from proplist() 2) if none of the above special properties is found again a "1" is returned from proplist() I suggest that the return from proplist should be different in these two cases e.g. 255 when "xxx" is not recognized in "xxx=yyy" , and perhaps in all 1,2,3 in the error returns from the ppn=, procs= and gpus= number checks. Then a more meaningful test could be performed on return from proplist. (I'd also store the offending string to shows the user what was unacceptable.) E.g. for error return 255, Job requirement specification: nodes=2:xxx=27 is not a valid request. "xxx" is not an acceptable special property, only ppn= , procs= and cpus= are acceptable here. For error return 3, Job requirement specification: nodes=2:gpus=yyy is not a valid request. String "yyy" after gpus= must be a positive integer. In general, I believe that error messages should let the user know 1) what is wrong in what they wrote, and 2) (if possible) how to change what they wrote into an something acceptable. More specifically change the code segment if (strcmp(pname, "ppn") == 0) { pequal++; if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) { return(1); } } else if(strcmp(pname, "procs") == 0) { pequal++; if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) { return(1); } } else if (strcmp(pname, "gpus") == 0) { pequal++; if ((number(&pequal, gpu_req) != 0) || (*pequal != '\0')) { return(1); } } else { return(1); /* not recognized - error */ } in server/node_manager.c to: if (strcmp(pname, "ppn") == 0) { pequal++; if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) { return(1); /* ppn= number not recognized - error */ } } else if(strcmp(pname, "procs") == 0) { pequal++; if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) { return(2); /* procs= number not recognized - error */ } } else if (strcmp(pname, "gpus") == 0) { pequal++; if ((number(&pequal, gpu_req) != 0) || (*pequal != '\0')) { return(3); /* gpus= number not recognized - error */ } } else { return(255); /* xxx= appears but xxx is not one of ppn , procs, or gpus - error */ } >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Gareth.Williams at csiro.au >Sent: Tuesday, December 06, 2011 12:11 AM >To: torqueusers at supercluster.org >Subject: Re: [torqueusers] Beginner Problem: MSG=cannot locate >feasible nodes > >> -----Original Message----- >> From: Coyle, James J [ITACD] [mailto:jjc at iastate.edu] >> Sent: Tuesday, 6 December 2011 7:03 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Beginner Problem: MSG=cannot locate >feasible >> nodes >> >> I think that you have a typo. >> >> Try using ppn=2 rather than npp=2 > >You are also getting ncpus and procct set from the default_max >numbers. This might be OK but might be problematic. I'd avoid >ncpus but procct is probably OK as I think it gets stripped from the >job as it is started anyway. > >All: is this a reasonable MSG? Would it be hard to make the feedback >more direct in this case? Is npp=2 in this context clearly an error >or could it be meaningful in a 'real' cluster configuration? > >Gareth > >> >> >> >> >> >> >-----Original Message----- >> >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> >bounces at supercluster.org] On Behalf Of Clarke Earley >> >Sent: Saturday, December 03, 2011 4:46 PM >> >To: torqueusers at supercluster.org >> >Subject: [torqueusers] Beginner Problem: MSG=cannot locate >feasible >> >nodes >> > >> >I am in the process of setting up Torque and Maui on a 2 node >> >cluster >> >running under Debian. Compilation and installation ran without an >> >problems >> >and submission of simple test jobs ( $ echo "sleep 30" | qsub ) >also >> >ran >> >without any issue. However, when I try to specify multiple >nodes, >> >the >> >jobs fail as follows. >> > >> > > $ echo "sleep 30" | qsub -l nodes=2:npp=2 -q batch >> > > qsub: Job exceeds queue resource limits MSG=cannot >> >satisfy >> >queue max nodes requirement >> > > $ echo "sleep 30" | qsub -l nodes=1:npp=2 -q batch >> > > qsub: Job exceeds queue resource limits MSG=cannot >> >locate >> >feasible nodes (nodes file is empty or all systems are busy) >> > >> >The file server_priv/nodes file exists on the master node >> >(thebrain): >> > > $ cat /var/spool/torque/server_priv/nodes >> > > thebrain np=12 >> > > yakko np=12 >> > >> >and appears to be recognized by pbsnodes >> > > $ pbsnodes >> > > thebrain >> > > state = free >> > > np = 12 >> > > ntype = cluster >> > > status = >> >>rectime=1322943942,varattr=,jobs=,state=free,netload=905815039,gres >= >> >>,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=97029336kb,totme >m >> >=97457404kb,idletime=1692,nusers=2,nsessions=11,sessions=19903 >> >24143 24149 24150 24151 24152 24896 24902 24903 24904 >> >24905,uname=Linux >> >chem 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 >> >x86_64,opsys=linux >> > > gpus = 0 >> > >> > > yakko >> > > state = free >> > > np = 12 >> > > ntype = cluster >> > > status = >> >>rectime=1322943940,varattr=,jobs=,state=free,netload=242901050,gres >= >> >>,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=56432072kb,totme >m >> >=56736504kb,idletime=3087,nusers=0,nsessions=? >> >0,sessions=? 0,uname=Linux yakko 3.0.0-1-amd64 #1 SMP Sat Aug 27 >> >16:21:11 UTC 2011 x86_64,opsys=linux >> > > gpus = 0 >> > >> >The output of qstat appears to indicate that resources are >> >available: >> > > $ qstat -Qf >> > > Queue: batch >> > > queue_type = Execution >> > > total_jobs = 0 >> > > state_count = Transit:0 Queued:0 Held:0 >Waiting:0 >> >Running:0 Exiting:0 >> > > resources_max.ncpus = 4 >> > > resources_max.nodes = 2 >> > > resources_max.procct = 24 >> > > resources_default.nodes = 1 >> > > resources_default.walltime = 01:00:00 >> > > mtime = 1322940014 >> > > resources_available.ncpus = 4 >> > > resources_available.nodes = 2 >> > > resources_available.procct = 24 >> > > resources_assigned.nodect = 0 >> > > enabled = True >> > > started = True >> > >> >I did not see anything in the log files that appeared helpful. >Any >> >suggestions would be most appreciated. Thank you in advance for >> >your help. >> > >> > >> >_______________________________________________ >> >torqueusers mailing list >> >torqueusers at supercluster.org >> >http://www.supercluster.org/mailman/listinfo/torqueusers > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From ChemProf89 at gmail.com Tue Dec 6 11:02:49 2011 From: ChemProf89 at gmail.com (Clarke Earley) Date: Tue, 06 Dec 2011 13:02:49 -0500 Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible nodes In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102C6360AC8@exvic-mbx04.nexus.csiro.au> References: <4EDAA6B7.1040106@gmail.com> <242421BFAF465844BE24EB90BB97E221017DDE2F@ITSDAG3D.its.iastate.edu> <007DECE986B47F4EABF823C1FBB19C620102C6360AC8@exvic-mbx04.nexus.csiro.au> Message-ID: <4EDE58C9.30809@gmail.com> Thank you. I was using the wrong option for qsub. However, it turns out that I had also not set the correct configure options to compile this program, so even when using the correct syntax (-l nodes=2:ppn=2) things did not work. The solution that appeared to work for my system (Debian Linux) was to set the following configuration options: > ./configure --enable-unixsockets --enable-cpuset --enable-geometry-requests I have not looked at the code, so I don't know how difficult it would be to change the error message. I ran the simple test: > $ echo "sleep 30" | qsub -l nodes=2:garbagein=garbageout Obviously, the garbagein parameter is not valid. However, the error message is not terribly helpful for new users. > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty or all systems are busy) One possible solution might be to append a bit of text to this message to read "cannot locate feasible nodes (nodes file is empty or all systems are busy) OR NODES PARAMETER NOT RECOGNIZED". This is a tricky problem, since the documentation indicates that arbitrary paramters can be appended to the nodes option. Please realize that none of this is meant to be a complaint, and I am very appreciative of the developers for making this program available. Again, thank you for your help with my issues. On 12/06/2011 01:10 AM, Gareth.Williams at csiro.au wrote: >> -----Original Message----- >> From: Coyle, James J [ITACD] [mailto:jjc at iastate.edu] >> Sent: Tuesday, 6 December 2011 7:03 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Beginner Problem: MSG=cannot locate feasible >> nodes >> >> I think that you have a typo. >> >> Try using ppn=2 rather than npp=2 > You are also getting ncpus and procct set from the default_max numbers. This might be OK but might be problematic. I'd avoid ncpus but procct is probably OK as I think it gets stripped from the job as it is started anyway. > > All: is this a reasonable MSG? Would it be hard to make the feedback more direct in this case? Is npp=2 in this context clearly an error or could it be meaningful in a 'real' cluster configuration? > > Gareth From biswas.koushik at gmail.com Tue Dec 6 11:16:43 2011 From: biswas.koushik at gmail.com (Koushik Biswas) Date: Tue, 6 Dec 2011 13:16:43 -0500 Subject: [torqueusers] Jobs crash if more nodes selected In-Reply-To: References: <4EDE33DA.1080909@tds.net> Message-ID: Thanks Svancara, Jason, Gustavo for the valuable opinions about some of my jobs crashing if more nodes are selected. I'll definitely look into these. I am using SUSE linux. In my bashrc, I have set: OMP_NUM_THREADS=1 and also stack size unlimited. But, I guess torque may still inherit the system default stacksize! -Koushik On Tue, Dec 6, 2011 at 12:02 PM, Gustavo Correa wrote: > > You can login to the nodes and use top to see how many processes and/or > threads > are executing, how much memory is being used, etc. > > Besides Jason's suggestions, check also the limits on the nodes, specially > stacksize, > locked memory, and maybe the max open files. > We get by with this in /etc/security/limits.conf: > > * - memlock -1 > * - stack -1 > * - nofile 4096 > > I know nothing about VASP, so these are guesses: > Does VASP launch threads [say via OpenMP] from each MPI process? > If yes, you may need to account for them, and give a physical core, or > perhaps a > 'hyperthreaded' core to each thread. > This may require reducing the np=4 in your mpiexec command line. > If the processes or threads are paging to disk, they you need to reduce np > for sure. > Worse comes to worst, turn off hyperthreading on the BIOS also. > > I hope this helps, > Gus Correa > > > On Dec 6, 2011, at 10:25 AM, Jason Bacon wrote: > > > > > If nodes are going down, I would first check the memory use of these > > jobs. There isn't much else that will bring down a Unix system, and > > this is fairly common on Linux systems which are pretty liberal about > > overcommitting memory by default. ( Typically by 30% of RAM capacity, > > so if you have 24G RAM and 4G swap, the system could allow up to 28G + > > 7.2G of total allocations to succeed. ) > > > > You can watch memory use with "top" on one of the compute nodes it gets > > dispatched to. > > > > To avoid this problem, I set low default (soft) limits on pvmem in my > > Torque config to force users to specify their memory requirements in > > jobs that require significant memory. If nobody specifies pvmem, > > they'll all be limited to less than the available RAM/core by the > > scheduler. If anyone needs more, then the scheduler will know about it, > > and won't dispatch their processes to a node that doesn't have enough > > memory. > > > > There's still a chance of overallocating, since the scheduler might only > > do periodic checks on actual use before killing the job, depending on > > Torque config and OS. If memory use grows quickly, the system could be > > overcommitted before the scheduler's next check. > > > > The prevent any chance of overallocating memory, you have to configure > > the kernel to use stricter policies. > > > > See the following for a thorough explanation: > > > > http://www.win.tue.nl/~aeb/linux/lk/lk-9.html > > > > Found by searching "linux overcommit memory". Not sure if you're using > > Linux, so the policies and methods could be quite different for you. > > Our Torque cluster runs FreeBSD, which is a bit more conservative > > out-of-the-box, but we've seen some nodes taken down by memory > > overcommit on our Redhat + LSF cluster. > > > > Regards, > > > > -J > > > > On 12/06/11 08:33, Koushik Biswas wrote: > >> OK, since I see everybody are so helpful here I thought I could get > >> away by getting some good hints/answers to my problems without having > >> to do my due diligence! I recently installed torque2.5.5 and pbs_sched > >> is my scheduler. I have a 68 node cluster and the following > >> questions/issues: > >> (1) I have intel i7 processors which has 4 cores and this > >> hyperthreading etc. In the "nodes" file I have np=4. Should I use > >> np=8? Could it have an performance benefit? > >> (2) I run a code called VASP. I see when I select 2 node for a job the > >> job run seeming OK. If I select more say 4 nodes the job crashes and > >> some of the nodes even go down. I have not really done much research > >> on this. I understand it could be openmpi or VASP compiling issues. If > >> anyone has seen similar behavior or anything to say about this, please > >> let me know. > >> Thanks, > >> Koushik > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > -- > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Jason W. Bacon > > jwbacon at tds.net > > http://personalpages.tds.net/~jwbacon > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/dbe5dad2/attachment.html From roman.ricardo at gmail.com Tue Dec 6 15:41:07 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Tue, 6 Dec 2011 16:41:07 -0600 Subject: [torqueusers] specific nodes In-Reply-To: <4EDC4920.2080006@unimelb.edu.au> References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> <4EDC4920.2080006@unimelb.edu.au> Message-ID: OK friends!! I think *We *got this running! Torque 2.5.1 Maui 3.3 MPICH2 1.4.1p1 with those, there's no need to build OSC mpiexec... =) but im having another problem here!!! but i think i will post in another thread for order's sake hehe... cya REALLY SOON! :D -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/233732d6/attachment.html From roman.ricardo at gmail.com Tue Dec 6 15:46:59 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Tue, 6 Dec 2011 16:46:59 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues Message-ID: Hi guys! i got this torque server runing 2.5.1: > [root at zarate-0 ld.so.conf.d]# qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > create queue uno > set queue uno queue_type = Execution > set queue uno resources_default.neednodes = uno > set queue uno resources_default.nodes = 1 > set queue uno resources_default.walltime = 01:00:00 > set queue uno enabled = True > set queue uno started = True > # > # Create and define queue dos > # > create queue dos > set queue dos queue_type = Execution > set queue dos resources_default.neednodes = dos > set queue dos resources_default.nodes = 1 > set queue dos resources_default.walltime = 01:00:00 > set queue dos enabled = True > set queue dos started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 99 and Maui's CFG file is: [root at zarate-0 maui]# cat maui.cfg > # maui.cfg 3.3 > SERVERHOST zarate-0 > # primary admin must be first in list > ADMIN1 root > # Resource Manager Definition > RMCFG[zarate-0] TYPE=PBS > # Allocation Manager Definition > AMCFG[bank] TYPE=NONE > # full parameter docs at > http://supercluster.org/mauidocs/a.fparameters.html > # use the 'schedctl -l' command to display current configuration > RMPOLLINTERVAL 00:00:30 > SERVERPORT 42559 > SERVERMODE NORMAL > # Admin: http://supercluster.org/mauidocs/a.esecurity.html > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html > QUEUETIMEWEIGHT 1 > # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html > #FSPOLICY PSDEDICATED > #FSDEPTH 7 > #FSINTERVAL 86400 > #FSDECAY 0.80 > # Throttling Policies: > http://supercluster.org/mauidocs/6.2throttlingpolicies.html > # NONE SPECIFIED > # Backfill: http://supercluster.org/mauidocs/8.2backfill.html > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html > NODEALLOCATIONPOLICY MINRESOURCE > # QOS: http://supercluster.org/mauidocs/7.3qos.html > # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB > # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE > # Standing Reservations: > http://supercluster.org/mauidocs/7.1.3standingreservations.html > # SRSTARTTIME[test] 8:00:00 > # SRENDTIME[test] 17:00:00 > # SRDAYS[test] MON TUE WED THU FRI > # SRTASKCOUNT[test] 20 > # SRMAXTIME[test] 0:30:00 > # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html > # USERCFG[DEFAULT] FSTARGET=25.0 > # USERCFG[john] PRIORITY=100 FSTARGET=10.0- > # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi > # CLASSCFG[batch] FLAGS=PREEMPTEE > # CLASSCFG[interactive] FLAGS=PREEMPTOR > ################################################## > ENABLEMULTIREQJOBS TRUE Now, when i issue a simple test job to queue "uno" it finishes fine but when i send the SAME job to queue "dos" it just wont run. Here's the job script: #PBS -q QUEUE > #PBS -l nodes=4 > echo "Nodes Assigned:" > cat $PBS_NODEFILE > echo "running... -l nodes=4 && -n 1" > /usr/local/bin/mpiexec -n 1 $HOME/a.out > echo > echo "running... -l nodes=4 && -n 2" > /usr/local/bin/mpiexec -n 2 $HOME/a.out > echo > echo "running... -l nodes=4 && -n 4" > /usr/local/bin/mpiexec -n 4 $HOME/a.out > echo > echo "running... -l nodes=4 && -n 8" > /usr/local/bin/mpiexec -n 8 $HOME/a.out > echo > > echo "done" The difference between the jobs is the queue where they run, the rest is the same. Any ideas? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111206/bd6d71a2/attachment-0001.html From gus at ldeo.columbia.edu Tue Dec 6 15:56:24 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 6 Dec 2011 17:56:24 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> <4EDC4920.2080006@unimelb.edu.au> Message-ID: On Dec 6, 2011, at 5:41 PM, Ricardo Rom?n Brenes wrote: > OK friends!! > > I think We got this running! > > Torque 2.5.1 > Maui 3.3 > MPICH2 1.4.1p1 > > with those, there's no need to build OSC mpiexec... > > =) > Congrats! > > but im having another problem here!!! but i think i will post in another thread for order's sake hehe... > Good idea. > cya REALLY SOON! :D Answers soon ... :) > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Tue Dec 6 16:07:27 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 6 Dec 2011 18:07:27 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: Message-ID: <68739902-BCC8-4FD0-B36C-BB7651F54367@ldeo.columbia.edu> Hi Ricardo What's the error message you get when you submit the job to queue 'dos'? Does it say something like there are not enough resources? I presume in your script you switch the work 'QUEUE' from 'uno' to 'dos' right? You must tell which specific queue you want, otherwise it will use the default. But since you don't set a default queue to the server [you could and should], the default may be something that Torque determines by itself. [To set 'uno' as thedefault queue, do: qmgr -c 'set server default_queue=uno' ] Do you have 4 nodes of type 'dos'? [You are requesting four, so you must have at least four of type 'dos'. Do you?] I confess I don't remember anymore what is in your $TORQUE/server_priv/nodes Would you send it again? It may help. Likewise for the output of 'pbsnodes' on the head node / Torque server. I hope this helps, Gus Correa On Dec 6, 2011, at 5:46 PM, Ricardo Rom?n Brenes wrote: > Hi guys! > > i got this torque server runing 2.5.1: > [root at zarate-0 ld.so.conf.d]# qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > create queue uno > set queue uno queue_type = Execution > set queue uno resources_default.neednodes = uno > set queue uno resources_default.nodes = 1 > set queue uno resources_default.walltime = 01:00:00 > set queue uno enabled = True > set queue uno started = True > # > # Create and define queue dos > # > create queue dos > set queue dos queue_type = Execution > set queue dos resources_default.neednodes = dos > set queue dos resources_default.nodes = 1 > set queue dos resources_default.walltime = 01:00:00 > set queue dos enabled = True > set queue dos started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 99 > > > and Maui's CFG file is: > > [root at zarate-0 maui]# cat maui.cfg > # maui.cfg 3.3 > SERVERHOST zarate-0 > # primary admin must be first in list > ADMIN1 root > # Resource Manager Definition > RMCFG[zarate-0] TYPE=PBS > # Allocation Manager Definition > AMCFG[bank] TYPE=NONE > # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html > # use the 'schedctl -l' command to display current configuration > RMPOLLINTERVAL 00:00:30 > SERVERPORT 42559 > SERVERMODE NORMAL > # Admin: http://supercluster.org/mauidocs/a.esecurity.html > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html > QUEUETIMEWEIGHT 1 > # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html > #FSPOLICY PSDEDICATED > #FSDEPTH 7 > #FSINTERVAL 86400 > #FSDECAY 0.80 > # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html > # NONE SPECIFIED > # Backfill: http://supercluster.org/mauidocs/8.2backfill.html > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html > NODEALLOCATIONPOLICY MINRESOURCE > # QOS: http://supercluster.org/mauidocs/7.3qos.html > # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB > # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE > # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html > # SRSTARTTIME[test] 8:00:00 > # SRENDTIME[test] 17:00:00 > # SRDAYS[test] MON TUE WED THU FRI > # SRTASKCOUNT[test] 20 > # SRMAXTIME[test] 0:30:00 > # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html > # USERCFG[DEFAULT] FSTARGET=25.0 > # USERCFG[john] PRIORITY=100 FSTARGET=10.0- > # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi > # CLASSCFG[batch] FLAGS=PREEMPTEE > # CLASSCFG[interactive] FLAGS=PREEMPTOR > ################################################## > ENABLEMULTIREQJOBS TRUE > > Now, when i issue a simple test job to queue "uno" it finishes fine but when i send the SAME job to queue "dos" it just wont run. Here's the job script: > > #PBS -q QUEUE > #PBS -l nodes=4 > echo "Nodes Assigned:" > cat $PBS_NODEFILE > echo "running... -l nodes=4 && -n 1" > /usr/local/bin/mpiexec -n 1 $HOME/a.out > echo > echo "running... -l nodes=4 && -n 2" > /usr/local/bin/mpiexec -n 2 $HOME/a.out > echo > echo "running... -l nodes=4 && -n 4" > /usr/local/bin/mpiexec -n 4 $HOME/a.out > echo > echo "running... -l nodes=4 && -n 8" > /usr/local/bin/mpiexec -n 8 $HOME/a.out > echo > > echo "done" > > The difference between the jobs is the queue where they run, the rest is the same. > > Any ideas? > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From scrusan at ur.rochester.edu Tue Dec 6 17:41:25 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Tue, 6 Dec 2011 19:41:25 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: Message-ID: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Dec 6, 2011, at 5:46 PM, Ricardo Rom?n Brenes wrote: > Hi guys! > > i got this torque server runing 2.5.1: > >> [root at zarate-0 ld.so.conf.d]# qmgr -c "p s" >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue uno >> # >> create queue uno >> set queue uno queue_type = Execution >> set queue uno resources_default.neednodes = uno >> set queue uno resources_default.nodes = 1 >> set queue uno resources_default.walltime = 01:00:00 >> set queue uno enabled = True >> set queue uno started = True >> # >> # Create and define queue dos >> # >> create queue dos >> set queue dos queue_type = Execution >> set queue dos resources_default.neednodes = dos >> set queue dos resources_default.nodes = 1 >> set queue dos resources_default.walltime = 01:00:00 >> set queue dos enabled = True >> set queue dos started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = zarate-0 >> set server log_events = 511 >> set server mail_from = adm >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server next_job_number = 99 > > > > and Maui's CFG file is: > > [root at zarate-0 maui]# cat maui.cfg >> # maui.cfg 3.3 >> SERVERHOST zarate-0 >> # primary admin must be first in list >> ADMIN1 root >> # Resource Manager Definition >> RMCFG[zarate-0] TYPE=PBS >> # Allocation Manager Definition >> AMCFG[bank] TYPE=NONE >> # full parameter docs at >> http://supercluster.org/mauidocs/a.fparameters.html >> # use the 'schedctl -l' command to display current configuration >> RMPOLLINTERVAL 00:00:30 >> SERVERPORT 42559 >> SERVERMODE NORMAL >> # Admin: http://supercluster.org/mauidocs/a.esecurity.html >> LOGFILE maui.log >> LOGFILEMAXSIZE 10000000 >> LOGLEVEL 3 >> # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html >> QUEUETIMEWEIGHT 1 >> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html >> #FSPOLICY PSDEDICATED >> #FSDEPTH 7 >> #FSINTERVAL 86400 >> #FSDECAY 0.80 >> # Throttling Policies: >> http://supercluster.org/mauidocs/6.2throttlingpolicies.html >> # NONE SPECIFIED >> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html >> BACKFILLPOLICY FIRSTFIT >> RESERVATIONPOLICY CURRENTHIGHEST >> # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html >> NODEALLOCATIONPOLICY MINRESOURCE >> # QOS: http://supercluster.org/mauidocs/7.3qos.html >> # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB >> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE >> # Standing Reservations: >> http://supercluster.org/mauidocs/7.1.3standingreservations.html >> # SRSTARTTIME[test] 8:00:00 >> # SRENDTIME[test] 17:00:00 >> # SRDAYS[test] MON TUE WED THU FRI >> # SRTASKCOUNT[test] 20 >> # SRMAXTIME[test] 0:30:00 >> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html >> # USERCFG[DEFAULT] FSTARGET=25.0 >> # USERCFG[john] PRIORITY=100 FSTARGET=10.0- >> # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi >> # CLASSCFG[batch] FLAGS=PREEMPTEE >> # CLASSCFG[interactive] FLAGS=PREEMPTOR >> ################################################## >> ENABLEMULTIREQJOBS TRUE > > > Now, when i issue a simple test job to queue "uno" it finishes fine but > when i send the SAME job to queue "dos" it just wont run. Here's the job > script: > > #PBS -q QUEUE >> #PBS -l nodes=4 >> echo "Nodes Assigned:" >> cat $PBS_NODEFILE >> echo "running... -l nodes=4 && -n 1" >> /usr/local/bin/mpiexec -n 1 $HOME/a.out >> echo >> echo "running... -l nodes=4 && -n 2" >> /usr/local/bin/mpiexec -n 2 $HOME/a.out >> echo >> echo "running... -l nodes=4 && -n 4" >> /usr/local/bin/mpiexec -n 4 $HOME/a.out >> echo >> echo "running... -l nodes=4 && -n 8" >> /usr/local/bin/mpiexec -n 8 $HOME/a.out >> echo >> >> echo "done" > > > The difference between the jobs is the queue where they run, the rest is Make sure your nodes have the right FEATURES. Meaning that queue dos requires that the node that will it's job on has the 'dos' feature, which should be specified in your $TORQUE_HOME/server_priv/nodes file. If your job for the queue dos is waiting the the queued state, run a checkjob $JOBID (and checkjob -v). That is a Maui command, but usually it will give you some reasons as to why the job isn't starting. > the same. > > Any ideas? > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJO3rY7AAoJENS19LGOpgqKEB8H/1j+624G/ZbCRX/sadp5HEVa c+1PPj5hyqroExdnN3fy0OcVg6I3gN6833oVIguGWgwqc+Yxk3VZZrfSLqV0NjQI 4upau/Z1MmV75kjunI+w94CT1sr0aOSpmtu7nTE0x5BSQy6Mgjd36UJDzrjP2Sup d0x8xApY0pPalYeRHR3ip90fU3i5asfWhnaJM//iisBbkawo4A2d++IIiUjx23h3 jAxrIQuCs6oOImnZcnjccjS/0nyqTcguctXZOh7Feixo86nPfKpQH9mAhAd4PkqD sMU/Tl4Z63XEgnYLgOcNzdc9Bqct9TEtIbFQQB3tPwaKmjrK04xPoycyZnL1JAE= =Oa/K -----END PGP SIGNATURE----- From zhaohscas at yahoo.com.cn Tue Dec 6 19:27:40 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 07 Dec 2011 10:27:40 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4EDE2A1F.9010902@tds.net> References: <4ED86182.2000202@yahoo.com.cn> <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu> <4EDDD5F2.7010507@yahoo.com.cn> <71F9B327-ED76-467F-9F09-E76D6241C5D2@ldeo.columbia.edu> <4EDE2A1F.9010902@tds.net> Message-ID: <4EDECF1C.5000508@yahoo.com.cn> On 12/06/2011 10:43 PM, Jason Bacon wrote: > If your head node is acting as a gateway, you have to be sure that the > hostname used by Torque is bound to the primary interface What's the meaning of *primary* interface? Do you mean the interface via which the cluster will be accessed remotely from anywhere on the internet? Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Tue Dec 6 19:42:24 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 07 Dec 2011 10:42:24 +0800 Subject: [torqueusers] Cann't generate the PBS log files. In-Reply-To: <4EDE2A1F.9010902@tds.net> References: <4ED86182.2000202@yahoo.com.cn> <13B3EC65-9E34-4267-B7E3-22A6468BC033@ldeo.columbia.edu> <4EDDD5F2.7010507@yahoo.com.cn> <71F9B327-ED76-467F-9F09-E76D6241C5D2@ldeo.columbia.edu> <4EDE2A1F.9010902@tds.net> Message-ID: <4EDED290.9040909@yahoo.com.cn> On 12/06/2011 10:43 PM, Jason Bacon wrote: > Head node/gateway: > > 129.89.25.224 peregrine.hpc.uwm.edu peregrine > > # Local names and addresses > 192.168.0.3 compute-01.local compute-01 > 192.168.0.4 compute-02.local compute-02 Why not keep the names' numbers and the IP addresses in consistance? Say, use the following naming rules: 192.168.0.3 compute-03.local compute-03 192.168.0.4 compute-04.local compute-04 > > Compute nodes: > > 192.168.0.2 peregrine.hpc.uwm.edu peregrine In this case, the *Head node* has two interfaces with one bound to the 192.168.0.2 and the other to the 129.89.25.224. The former is for internal accessibility and the latter is for external accessibility from anywhere, i.e., you use this node as the gateway. But, it seems that you also use this node as the compute node, so, what about change the above lines into the following one for convenience: 192.168.0.2 compute-02.local compute-02 peregrine Any hints on my above suggestions will be highly appreciated ;-) > > # Local names and addresses > 192.168.0.3 compute-01.local compute-01 > 192.168.0.4 compute-02.local compute-02 Why do you write the above three lines once again? Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Tue Dec 6 20:20:57 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 07 Dec 2011 11:20:57 +0800 Subject: [torqueusers] The issue when runing dmol3. In-Reply-To: References: <4EDDF3C8.4030103@yahoo.com.cn> Message-ID: <4EDEDB99.6040900@yahoo.com.cn> On 12/06/2011 10:10 PM, Gustavo Correa wrote: > Hi Hongsheng > > It could be many things. > My guess is that /public/home is a directory shared via NFS, Yes. > maybe physically located in node32, and mounted on all nodes. The physical location is node31:/public. See the following mounting settings used by nfs for my case: NFSDIR node31 /public /public > Maybe a few nodes are not mounting it correctly. > You could login to each node and make sure it is being mounted. > The /etc/auto.* files are the first thing to look at. I can ssh access to each of these nodes without password successfully. As for the /etc/auto.* files you mentioned above, I cann't figure out the clue from them for my issue. See the following for detail: ---------- zhaohongsheng at node32:~> ls /etc/auto.* /etc/auto.master /etc/auto.misc /etc/auto.net /etc/auto.smb zhaohongsheng at node32:~> cat /etc/auto.* # # $Id: auto.master,v 1.4 2005/01/04 14:36:54 raven Exp $ # # Sample auto.master file # This is an automounter map and it has the following format # key [ -mount-options-separated-by-comma ] location # For details of the format look at autofs(5). #/misc /etc/auto.misc --timeout=60 #/smb /etc/auto.smb #/misc /etc/auto.misc #/net /etc/auto.net # # $Id: auto.misc,v 1.2 2003/09/29 08:22:35 raven Exp $ # # This is an automounter map and it has the following format # key [ -mount-options-separated-by-comma ] location # Details may be found in the autofs(5) manpage cd -fstype=iso9660,ro,nosuid,nodev :/dev/cdrom # the following entries are samples to pique your imagination #linux -ro,soft,intr ftp.example.org:/pub/linux #boot -fstype=ext2 :/dev/hda1 #floppy -fstype=auto :/dev/fd0 #floppy -fstype=ext2 :/dev/fd0 #e2floppy -fstype=ext2 :/dev/fd0 #jaz -fstype=ext2 :/dev/sdc1 #removable -fstype=ext2 :/dev/hdd #!/bin/bash # $Id: auto.net,v 1.8 2005/04/05 13:02:09 raven Exp $ # This file must be executable to work! chmod 755! # Look at what a host is exporting to determine what we can mount. # This is very simple, but it appears to work surprisingly well key="$1" . /etc/sysconfig/autofs opts="$AUTO_NET_FLAGS -fstype=nfs,hard,intr,nodev,nosuid" # Showmount comes in a number of names and varieties. "showmount" is # typically an older version which accepts the '--no-headers' flag # but ignores it. "kshowmount" is the newer version installed with knfsd, # which both accepts and acts on the '--no-headers' flag. #SHOWMOUNT="kshowmount --no-headers -e $key" #SHOWMOUNT="showmount -e $key | tail -n +2" for P in /bin /sbin /usr/bin /usr/sbin do for M in showmount kshowmount do if [ -x $P/$M ] then SMNT=$P/$M break fi done done [ -x $SMNT ] || exit 1 # Newer distributions get this right SHOWMOUNT="$SMNT --no-headers -e $key" $SHOWMOUNT | LC_ALL=C sort +0 | \ awk -v key="$key" -v opts="$opts" -- ' BEGIN { ORS=""; first=1 } { if (first) { print opts; first=0 }; print " \\\n\t" $1, key ":" $1 } END { if (!first) print "\n"; else exit 1 } ' #!/bin/bash # $Id: auto.smb,v 1.3 2005/04/05 13:02:09 raven Exp $ # This file must be executable to work! chmod 755! key="$1" opts="-fstype=smbfs" for P in /bin /sbin /usr/bin /usr/sbin do if [ -x $P/smbclient ] then SMBCLIENT=$P/smbclient break fi done [ -x $SMBCLIENT ] || exit 1 $SMBCLIENT -gNL $key 2>/dev/null| awk -v key="$key" -v opts="$opts" -F'|' -- ' BEGIN { ORS=""; first=1 } /Disk/ { if (first) { print opts; first=0 }; print " \\\n\t /" $2, "://" key "/" $2 } END { if (!first) print "\n"; else exit 1 } ' zhaohongsheng at node32:~> ----------- Best regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From zhaohscas at yahoo.com.cn Tue Dec 6 20:25:05 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 07 Dec 2011 11:25:05 +0800 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> Message-ID: <4EDEDC91.2050007@yahoo.com.cn> On 12/06/2011 09:02 PM, Jan Kasiak wrote: > Hi, > > Thats not actually true. I think half the cores are hyperthreaded. Look > up your processor on Intel's website for exact core count. cpuinfo > reports logical and not physical cores. It also looks like you have a > dual socket node. > Do you mean the np should be equal to the number of the physical cores for specific node? Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From samuel at unimelb.edu.au Tue Dec 6 20:43:57 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 07 Dec 2011 14:43:57 +1100 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> <1BAB6393-963A-4A58-8223-318FAFC5AF99@ldeo.columbia.edu> <4EDC4920.2080006@unimelb.edu.au> Message-ID: <4EDEE0FD.8020803@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/12/11 09:41, Ricardo Rom?n Brenes wrote: > with those, there's no need to build OSC mpiexec... Does MPICH2 include support for TM now ? If not, then you still would benefit from it. - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7e4P0ACgkQO2KABBYQAh+nkQCfUYSWB2Z3zwNR/NT0xxyVLy04 4pIAni+g3t/cvWSYBfzSuMiZtICImDbZ =itCi -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Dec 6 21:03:25 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 07 Dec 2011 15:03:25 +1100 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4EDEDC91.2050007@yahoo.com.cn> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> <4EDEDC91.2050007@yahoo.com.cn> Message-ID: <4EDEE58D.3010204@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/12/11 14:25, Hongsheng Zhao wrote: > Do you mean the np should be equal to the number of the physical cores > for specific node? You probably want to disable hyperthreading if you are concerned that it will impact performance. That way you'll only see the physical cpus and if you run with cpusets support enabled you'll only be binding to physical cores and not the SMT ones. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7e5Y0ACgkQO2KABBYQAh/ZfACeOpnIhhTc9QYFowqBBkh65Db1 vwEAoIpO/6yH9SieuD3Ayxmp4SeIlSUC =OIpw -----END PGP SIGNATURE----- From mailmaverick666 at gmail.com Wed Dec 7 00:52:18 2011 From: mailmaverick666 at gmail.com (rishi pathak) Date: Wed, 7 Dec 2011 13:22:18 +0530 Subject: [torqueusers] syntax for requesting 10 nodes with 1 core on each node exclusively In-Reply-To: References: Message-ID: Torque wont allow you to do that. -l nodes=X:ppn=1 is equivalent to -l nodes=X Solution would be to set ppn to 2 and while execution of MPI program(assuming its MPI), ask process manager to start 1 process per node. On Tue, Dec 6, 2011 at 8:46 PM, Shaomin Hu wrote: > We are using Torque v3.0.2 and want to run a job exclusively on 10 > different nodes, 1 core on each node. There are 16 cores on each node. If > we use the syntax "-lnodes=10:ppn=1 -n", the job still only get one node > exclusively to run, not 10 different nodes. > > What syntax should we use to get 10 different nodes, 1 core on each node > exclusively? > > Thanks much. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- --- Rishi Pathak National PARAM Supercomputing Facility C-DAC, Pune, India -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/8600faa3/attachment.html From tiago.silva at cefas.co.uk Wed Dec 7 02:30:21 2011 From: tiago.silva at cefas.co.uk (Tiago Silva (Cefas)) Date: Wed, 7 Dec 2011 09:30:21 -0000 Subject: [torqueusers] hydra *and* mpd In-Reply-To: <04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk> References: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk> <04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk> Message-ID: <04A370231C10664C88B28D1EF74F487903360BD5@LOWEXPRESS.corp.cefas.co.uk> I have read about how to submit jobs using OSC's mpiexec and with mpd. Simple question, can I mix them, and all the jobs will all be accounted for by torque? tiago From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva (Cefas) Sent: 06 December 2011 14:30 To: torqueusers at supercluster.org Subject: [torqueusers] hydra *and* mpd Hi We have a 20 node cluster with rocks 5.3 and mpich2 1.3.1. Most users use mpiexec with hydra, but one of our models requires a second version of mpich that was compiled differently and we submit these jobs using mpirun and an mpd ring. We have rocks 5.3 but due to lack of foresight torque wasn't installed when the cluster was built. We are planning to install torque on top of the existing installation. Would torque be able to handle two different initialization methods (mpirun and mpiexec)? Thanks, Tiago This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/de54a516/attachment-0001.html From Gareth.Williams at csiro.au Wed Dec 7 02:45:46 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 7 Dec 2011 20:45:46 +1100 Subject: [torqueusers] syntax for requesting 10 nodes with 1 core on each node exclusively In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C6360AE3@exvic-mbx04.nexus.csiro.au> That is not true without qualification (maybe I could state that better). Whether the two are equivalent depends on the scheduler and its configuration. Others have already posted useful answers. Sorry for the non plain text top post :) Gareth From: rishi pathak [mailto:mailmaverick666 at gmail.com] Sent: Wednesday, 7 December 2011 6:52 PM To: hu8 at purdue.edu; Torque Users Mailing List Subject: Re: [torqueusers] syntax for requesting 10 nodes with 1 core on each node exclusively Torque wont allow you to do that. -l nodes=X:ppn=1 is equivalent to -l nodes=X Solution would be to set ppn to 2 and while execution of MPI program(assuming its MPI), ask process manager to start 1 process per node. On Tue, Dec 6, 2011 at 8:46 PM, Shaomin Hu > wrote: We are using Torque v3.0.2 and want to run a job exclusively on 10 different nodes, 1 core on each node. There are 16 cores on each node. If we use the syntax "-lnodes=10:ppn=1 -n", the job still only get one node exclusively to run, not 10 different nodes. What syntax should we use to get 10 different nodes, 1 core on each node exclusively? Thanks much. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- --- Rishi Pathak National PARAM Supercomputing Facility C-DAC, Pune, India -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/ca5f89fe/attachment.html From govind.rhul at gmail.com Wed Dec 7 08:06:58 2011 From: govind.rhul at gmail.com (Govind B. Songara) Date: Wed, 7 Dec 2011 15:06:58 +0000 Subject: [torqueusers] max jobs per group per node Message-ID: Hi, I am looking an option in torque/maui to configure max jobs per group per node. Can someone please give some idea? Thanks Govind -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/f67247da/attachment.html From gus at ldeo.columbia.edu Wed Dec 7 10:34:23 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 12:34:23 -0500 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4EDEDC91.2050007@yahoo.com.cn> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> <4EDEDC91.2050007@yahoo.com.cn> Message-ID: On Dec 6, 2011, at 10:25 PM, Hongsheng Zhao wrote: > On 12/06/2011 09:02 PM, Jan Kasiak wrote: >> Hi, >> >> Thats not actually true. I think half the cores are hyperthreaded. Look >> up your processor on Intel's website for exact core count. cpuinfo >> reports logical and not physical cores. It also looks like you have a >> dual socket node. >> > > Do you mean the np should be equal to the number of the physical cores > for specific node? Hi Hongsheng As far as I know, this is debatable. Some parallel programs don't scale well when hyperthreading is turned on, but other programs do well. Scaling is seldom what you would expect with physical cores, i.e., if you have 8 physical cores and it appears as 16 because of hyperthreading, when you switch your program from using 8 to 16 processes, the speedup is not a factor of 2, but often significantly less. However, a factor of 1.2 may still be a very good thing. There is some hassle to manage the situation in a hyperthreaded node when several jobs share the node. So, in this case, to make things easy from the management standpoint at least, you may choose to turn off hyperthreading, which is typically done on the BIOS settings. Two years ago I ran some jobs on an IBM machine where hyperthreading [which IBM calls symmetric multithreading or SMT] could be turned on or off at runtime by the MPI job. Each node had 32 physical cores, that could look like 64 by just setting an environment variable in the job script. Speedups in the specific program I ran [a climate general circulation model] when going from 32 physical cores to 64 SMT cores was more like 1.2-1.3 than 2. However, this was still very good, specially considering that the lab charges were based on full node utilization per hour, regardless of SMT being on or off. Turning on SMT at run time was nice, but I don't think this is [yet] feasible with hyperthreading in Linux. [I may be wrong about this, and somebody more knowledgeable on these matters in the list could clarify the point.] My two cents, Gus Correa > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Wed Dec 7 10:41:34 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 12:41:34 -0500 Subject: [torqueusers] hydra *and* mpd In-Reply-To: <04A370231C10664C88B28D1EF74F487903360BD5@LOWEXPRESS.corp.cefas.co.uk> References: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk> <04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk> <04A370231C10664C88B28D1EF74F487903360BD5@LOWEXPRESS.corp.cefas.co.uk> Message-ID: Hi Tiago Use OSC mpiexec alone. It works beautifully with Torque. No need for mpd in this case, and it may actually only mess things up. I would shut down all mpd rings that may be there, and forget about mpd. Actually, mpd was deprecated by the MPICH2 development team, AFAIK. Make sure you point to the OSC mpiexec in the command line, say, by using full path name to it, to avoid the risk of mistakenly using the mpiexec that came with MPICH2. My two cents, Gus Correa On Dec 7, 2011, at 4:30 AM, Tiago Silva (Cefas) wrote: > I have read about how to submit jobs using OSC?s mpiexec and with mpd. > Simple question, can I mix them, and all the jobs will all be accounted for by torque? > > tiago > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva (Cefas) > Sent: 06 December 2011 14:30 > To: torqueusers at supercluster.org > Subject: [torqueusers] hydra *and* mpd > > Hi > > We have a 20 node cluster with rocks 5.3 and mpich2 1.3.1. Most users use mpiexec with hydra, but one of our models requires a second version of mpich that was compiled differently and we submit these jobs using mpirun and an mpd ring. > > We have rocks 5.3 but due to lack of foresight torque wasn?t installed when the cluster was built. We are planning to install torque on top of the existing installation. Would torque be able to handle two different initialization methods (mpirun and mpiexec)? > > Thanks, > Tiago > > > > > > > > This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Dec 7 11:01:11 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 12:01:11 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> Message-ID: For gus =) : There is no error message it just stays in a status Q [root at zarate-0 ~]# cat /var/spool/pbs/server_priv/nodes zarate-0 np=2 uno zarate-1 np=2 uno zarate-2 np=2 dos zarate-3 np=2 dos [root at zarate-0 ~]# pbsnodes zarate-0 state = free np = 2 properties = uno ntype = cluster status = opsys=linux,uname=Linux zarate-0 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=1601,nsessions=1,nusers=1,idletime=149,totmem=736568kb,availmem=628092kb,physmem=212288kb,ncpus=2,loadave=0.02,gres=,netload=51327373,state=free,jobs=,varattr=,rectime=1323279946 zarate-1 state = free np = 2 properties = uno ntype = cluster status = opsys=linux,uname=Linux zarate-1 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85536,totmem=730040kb,availmem=650168kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=17064175,state=free,jobs=,varattr=,rectime=1323279949 zarate-2 state = free np = 2 properties = dos ntype = cluster status = opsys=linux,uname=Linux zarate-2 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85527,totmem=737208kb,availmem=659320kb,physmem=212288kb,ncpus=2,loadave=0.04,gres=,netload=26043036,state=free,jobs=,varattr=,rectime=1323279953 zarate-3 state = free np = 2 properties = dos ntype = cluster status = opsys=linux,uname=Linux zarate-3 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85538,totmem=737208kb,availmem=657236kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=58004536,state=free,jobs=,varattr=,rectime=1323279969 For Steve: this is hte checkjob: [rroman at zarate-0:~/outputs]$ checkjob 104 checking job 104 State: Running Creds: user:rroman group:usuariosCluster class:dos qos:DEFAULT WallTime: 00:00:00 of 1:00:00 SubmitTime: Wed Dec 7 11:46:03 (Time Queued Total: 00:18:11 Eligible: 00:18:11) StartTime: Wed Dec 7 12:04:14 Total Tasks: 4 Req[0] TaskCount: 4 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [dos] Allocated Nodes: [zarate-3:2][zarate-2:2] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1090 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '104' (00:00:00 -> 1:00:00 Duration: 1:00:00) PE: 4.00 StartPriority: 18 it says its RUNNING! but on the qstat the shows to be QUEUE... [rroman at zarate-0:~/outputs]$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 104.zarate-0 a.t rroman 0 Q dos Also i discovered with qstat -fl that: [rroman at zarate-0:~/outputs]$ qstat -fl Job Id: 104.zarate-0 Job_Name = a.t Job_Owner = rroman at zarate-0 job_state = Q queue = dos server = zarate-0 Checkpoint = u ctime = Wed Dec 7 11:46:03 2011 Error_Path = zarate-0:/home/rroman/outputs/a.t.e104 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Dec 7 12:07:02 2011 Output_Path = zarate-0:/home/rroman/outputs/a.t.o104 Priority = 0 qtime = Wed Dec 7 11:46:03 2011 Rerunable = True Resource_List.nodect = 4 Resource_List.nodes = 4 Resource_List.walltime = 01:00:00 Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, PBS_O_LOGNAME=rroman, PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos etime = Wed Dec 7 11:46:03 2011 exit_status = -3 submit_args = ../a.t start_time = Wed Dec 7 12:07:02 2011 Walltime.Remaining = 360 start_count = 1258 fault_tolerant = False The job has an exit status of -3 (which is "JOB_EXEC_RETRY -3 job execution failed, do retry) but i dont have a clue why that happened and why Maui and Torque report different status (running, queued and exit with -3). While performing some test i realized that the problem is in zarate-2. If i run the same job with 1 or 2 nodes (assiged to zarate-3) it works. But when zarate-2 has to be assigned jobs i get this behavior. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/2cb5efa2/attachment-0001.html From Dominik.Epple at emea.nec.com Wed Dec 7 06:01:09 2011 From: Dominik.Epple at emea.nec.com (Dominik Epple) Date: Wed, 7 Dec 2011 13:01:09 +0000 Subject: [torqueusers] Torque 4 beta: build issues Message-ID: <3B1DDC6D63ABC24BBF3ED342DEC683252121BB@EX10MBX01.EU.NEC.COM> Hello, I just wanted to try out the torque 4 beta. I encounter some build issues, but first I want to ask whether I use the correct download, because this perhaps makes the following question obsolete. So: where to download the torque 4 beta? I found http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0_lanl.tar.gz . Is this the correct download? Then, when I use this download to build RPMs, some minor (mostly trivial) adjustments in the specfile are required to build it. But not so trivial is the following error. When enabling drmaa, rpmbuild fails, and the last few lines in its output are given below. make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/pam' make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/pam' Making install in drmaa make[2]: Entering directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa' Making install in src make[3]: Entering directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa/src' make install-am make[4]: Entering directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa/src' make[5]: Entering directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa/src' test -z "/usr/lib64" || /bin/mkdir -p "/root/rpmbuild/BUILDROOT/torque-4.0_lanl-1.cri.x86_64/usr/lib64" /bin/sh ../../../libtool --mode=install /usr/bin/install -c libdrmaa.la '/root/rpmbuild/BUILDROOT/torque-4.0_lanl-1.cri.x86_64/usr/lib64' libtool: install: error: cannot install `libdrmaa.la' to a directory not ending in /usr/local/lib make[5]: *** [install-libLTLIBRARIES] Error 1 make[5]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa/src' make[4]: *** [install-am] Error 2 make[4]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa/src' make[3]: *** [install] Error 2 make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa/src' make[2]: *** [install-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src/drmaa' make[1]: *** [install-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0_lanl/src' make: *** [install-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.Qwd53N (%install) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.Qwd53N (%install) So we have some problem with libtool that seems to exist in a similar fashion in some other projects from time to time since a lot of years, but I was unable to learn from my google hits what is the reason for this error and how to fix it. Regards Dominik From cwebberops at gmail.com Wed Dec 7 11:18:15 2011 From: cwebberops at gmail.com (Christopher Webber) Date: Wed, 7 Dec 2011 10:18:15 -0800 Subject: [torqueusers] Node Maintenance Message-ID: Using the torque scheduler, is there a way to stop new jobs from being placed on a node but allowing current jobs there to finish so a node can have maintenance performed on it? -- cwebber Christopher Webber - Systems Administrator Bioinformatics - University of California, Riverside Twitter: @cwebber Tel: 951.867.7108 http://cwebber.ucr.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/75e7d059/attachment.html From knielson at adaptivecomputing.com Wed Dec 7 11:21:36 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 07 Dec 2011 11:21:36 -0700 (MST) Subject: [torqueusers] Torque 4 beta: build issues In-Reply-To: <3B1DDC6D63ABC24BBF3ED342DEC683252121BB@EX10MBX01.EU.NEC.COM> Message-ID: ----- Original Message ----- > From: "Dominik Epple" > To: torqueusers at supercluster.org > Sent: Wednesday, December 7, 2011 6:01:09 AM > Subject: [torqueusers] Torque 4 beta: build issues > > Hello, > > I just wanted to try out the torque 4 beta. I encounter some build > issues, but first I want to ask whether I use the correct download, > because this perhaps makes the following question obsolete. So: > where to download the torque 4 beta? I found > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0_lanl.tar.gz > . Is this the correct download? > Dominik, This is the wrong download. We have not posted an official tar ball yet. However, you can use subversion and download using the following syntax: svn co svn://clusterresources.com/torque/trunk We are still making several changes but welcome feedback. Especially concerning the build process. Regards Ken Nielson Adaptive Computing From gus at ldeo.columbia.edu Wed Dec 7 11:37:11 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 13:37:11 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> Message-ID: <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Hi Ricardo You set only two nodes of type 'uno' and another two of type 'dos' So, how come that in your PBS script you request four nodes of each type? [You stripped off your PBS script from this email, it was in a previous message, and it had the line "#PBS -l nodes=4" If I remember right your goal was to direct the jobs to the right type of node, correct? However, you must have the resources that you are requesting. If you ask for more than you have, you won't get it. My suggestion is. Try one job with: #PBS -q uno #PBS -l nodes=2:ppn=2 ... mpiexec -np=4 ... Try another job with: #PBS -q dos #PBS -l nodes=2:ppn=2 ... mpiexec -np=4 ... If you also want to run jobs on four nodes, create a third queue, say 'unoedos', and do *not* set the resources_default.neednodes on that one, so that any type of node will go on that one. Then try a job: #PBS -q unoedos #PBS -l nodes=4:ppn=2 ... mpiexec -np=8 ... On which nodes do you expect each of the three jobs above to run on? Does it make sense for you? Well, save the case that zarate-2 has a problem, of course, perhaps misinstalled pbs_mom or something else. I hope this helps ... and that it works ... :) Good luck, Gus Correa On Dec 7, 2011, at 1:01 PM, Ricardo Rom?n Brenes wrote: > For gus =) : > There is no error message it just stays in a status Q > > [root at zarate-0 ~]# cat /var/spool/pbs/server_priv/nodes > zarate-0 np=2 uno > zarate-1 np=2 uno > zarate-2 np=2 dos > zarate-3 np=2 dos > > > [root at zarate-0 ~]# pbsnodes > zarate-0 > state = free > np = 2 > properties = uno > ntype = cluster > status = opsys=linux,uname=Linux zarate-0 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=1601,nsessions=1,nusers=1,idletime=149,totmem=736568kb,availmem=628092kb,physmem=212288kb,ncpus=2,loadave=0.02,gres=,netload=51327373,state=free,jobs=,varattr=,rectime=1323279946 > > zarate-1 > state = free > np = 2 > properties = uno > ntype = cluster > status = opsys=linux,uname=Linux zarate-1 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85536,totmem=730040kb,availmem=650168kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=17064175,state=free,jobs=,varattr=,rectime=1323279949 > > zarate-2 > state = free > np = 2 > properties = dos > ntype = cluster > status = opsys=linux,uname=Linux zarate-2 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85527,totmem=737208kb,availmem=659320kb,physmem=212288kb,ncpus=2,loadave=0.04,gres=,netload=26043036,state=free,jobs=,varattr=,rectime=1323279953 > > zarate-3 > state = free > np = 2 > properties = dos > ntype = cluster > status = opsys=linux,uname=Linux zarate-3 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85538,totmem=737208kb,availmem=657236kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=58004536,state=free,jobs=,varattr=,rectime=1323279969 > > > > For Steve: > > this is hte checkjob: > [rroman at zarate-0:~/outputs]$ checkjob 104 > > > checking job 104 > > State: Running > Creds: user:rroman group:usuariosCluster class:dos qos:DEFAULT > WallTime: 00:00:00 of 1:00:00 > SubmitTime: Wed Dec 7 11:46:03 > (Time Queued Total: 00:18:11 Eligible: 00:18:11) > > StartTime: Wed Dec 7 12:04:14 > Total Tasks: 4 > > Req[0] TaskCount: 4 Partition: DEFAULT > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [dos] > Allocated Nodes: > [zarate-3:2][zarate-2:2] > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1090 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '104' (00:00:00 -> 1:00:00 Duration: 1:00:00) > PE: 4.00 StartPriority: 18 > > it says its RUNNING! but on the qstat the shows to be QUEUE... > > [rroman at zarate-0:~/outputs]$ qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 104.zarate-0 a.t rroman 0 Q dos > > Also i discovered with qstat -fl that: > [rroman at zarate-0:~/outputs]$ qstat -fl > Job Id: 104.zarate-0 > Job_Name = a.t > Job_Owner = rroman at zarate-0 > job_state = Q > queue = dos > server = zarate-0 > Checkpoint = u > ctime = Wed Dec 7 11:46:03 2011 > Error_Path = zarate-0:/home/rroman/outputs/a.t.e104 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = a > mtime = Wed Dec 7 12:07:02 2011 > Output_Path = zarate-0:/home/rroman/outputs/a.t.o104 > Priority = 0 > qtime = Wed Dec 7 11:46:03 2011 > Rerunable = True > Resource_List.nodect = 4 > Resource_List.nodes = 4 > Resource_List.walltime = 01:00:00 > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > PBS_O_LOGNAME=rroman, > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > etime = Wed Dec 7 11:46:03 2011 > exit_status = -3 > submit_args = ../a.t > start_time = Wed Dec 7 12:07:02 2011 > Walltime.Remaining = 360 > start_count = 1258 > fault_tolerant = False > > > The job has an exit status of -3 (which is "JOB_EXEC_RETRY -3 job execution failed, do retry) but i dont have a clue why that happened and why Maui and Torque report different status (running, queued and exit with -3). > > > > While performing some test i realized that the problem is in zarate-2. If i run the same job with 1 or 2 nodes (assiged to zarate-3) it works. But when zarate-2 has to be assigned jobs i get this behavior. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Wed Dec 7 11:39:59 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 13:39:59 -0500 Subject: [torqueusers] Node Maintenance In-Reply-To: References: Message-ID: <0F6C7536-CE22-4ECA-B7FB-86A5F70157DC@ldeo.columbia.edu> Hi Christopher pbsnodes -o node_name It will put the node offline, preventing new jobs to be scheduled, but allowing the current jobs to finish. For more detail check 'man pbsnodes' I hope it helps, Gus Correa On Dec 7, 2011, at 1:18 PM, Christopher Webber wrote: > Using the torque scheduler, is there a way to stop new jobs from being placed on a node but allowing current jobs there to finish so a node can have maintenance performed on it? > > -- cwebber > > Christopher Webber - Systems Administrator > Bioinformatics - University of California, Riverside > > Twitter: @cwebber > Tel: 951.867.7108 > http://cwebber.ucr.edu > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Dec 7 13:21:33 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 14:21:33 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Message-ID: HI thanks for replying so fast =) In the script when i ask for 2 nodes, those 2 nodes are on hte same machine. dont ask me why but i think tis because the nodes files has np=2; when i ask for 3 nodes i get 2 times zarate-0 and 1 time zarate-1. If i ask for 4 nodes i get 2 times each node. And thanks for pointing out that joined queue! =) i'm going to check the sanity of zarate-2 :P On Wed, Dec 7, 2011 at 12:37 PM, Gustavo Correa wrote: > Hi Ricardo > > You set only two nodes of type 'uno' and another two of type 'dos' > So, how come that in your PBS script you request four nodes of each type? > [You stripped off your PBS script from this email, it was in a previous > message, > and it had the line "#PBS -l nodes=4" > > If I remember right your goal was to direct the jobs to the right type of > node, correct? > However, you must have the resources that you are requesting. > If you ask for more than you have, you won't get it. > > My suggestion is. > Try one job with: > > #PBS -q uno > #PBS -l nodes=2:ppn=2 > ... > mpiexec -np=4 ... > > Try another job with: > > #PBS -q dos > #PBS -l nodes=2:ppn=2 > ... > mpiexec -np=4 ... > > If you also want to run jobs on four nodes, create a third queue, say > 'unoedos', > and do *not* set the resources_default.neednodes on that one, so that any > type > of node will go on that one. > Then try a job: > > #PBS -q unoedos > #PBS -l nodes=4:ppn=2 > ... > mpiexec -np=8 ... > > On which nodes do you expect each of the three jobs above to run on? > Does it make sense for you? > Well, save the case that zarate-2 has a problem, of course, > perhaps misinstalled pbs_mom or something else. > > I hope this helps ... and that it works ... :) > > Good luck, > Gus Correa > > On Dec 7, 2011, at 1:01 PM, Ricardo Rom?n Brenes wrote: > > > For gus =) : > > There is no error message it just stays in a status Q > > > > [root at zarate-0 ~]# cat /var/spool/pbs/server_priv/nodes > > zarate-0 np=2 uno > > zarate-1 np=2 uno > > zarate-2 np=2 dos > > zarate-3 np=2 dos > > > > > > [root at zarate-0 ~]# pbsnodes > > zarate-0 > > state = free > > np = 2 > > properties = uno > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-0 2.6.29.4-167.fc11.ppc64 > #1 SMP Wed May 27 17:18:17 EDT 2009 > ppc64,sessions=1601,nsessions=1,nusers=1,idletime=149,totmem=736568kb,availmem=628092kb,physmem=212288kb,ncpus=2,loadave=0.02,gres=,netload=51327373,state=free,jobs=,varattr=,rectime=1323279946 > > > > zarate-1 > > state = free > > np = 2 > > properties = uno > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-1 2.6.29.4-167.fc11.ppc64 > #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? > 0,nusers=0,idletime=85536,totmem=730040kb,availmem=650168kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=17064175,state=free,jobs=,varattr=,rectime=1323279949 > > > > zarate-2 > > state = free > > np = 2 > > properties = dos > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-2 2.6.29.4-167.fc11.ppc64 > #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? > 0,nusers=0,idletime=85527,totmem=737208kb,availmem=659320kb,physmem=212288kb,ncpus=2,loadave=0.04,gres=,netload=26043036,state=free,jobs=,varattr=,rectime=1323279953 > > > > zarate-3 > > state = free > > np = 2 > > properties = dos > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-3 2.6.29.4-167.fc11.ppc64 > #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? > 0,nusers=0,idletime=85538,totmem=737208kb,availmem=657236kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=58004536,state=free,jobs=,varattr=,rectime=1323279969 > > > > > > > > For Steve: > > > > this is hte checkjob: > > [rroman at zarate-0:~/outputs]$ checkjob 104 > > > > > > checking job 104 > > > > State: Running > > Creds: user:rroman group:usuariosCluster class:dos qos:DEFAULT > > WallTime: 00:00:00 of 1:00:00 > > SubmitTime: Wed Dec 7 11:46:03 > > (Time Queued Total: 00:18:11 Eligible: 00:18:11) > > > > StartTime: Wed Dec 7 12:04:14 > > Total Tasks: 4 > > > > Req[0] TaskCount: 4 Partition: DEFAULT > > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > > Opsys: [NONE] Arch: [NONE] Features: [dos] > > Allocated Nodes: > > [zarate-3:2][zarate-2:2] > > > > > > IWD: [NONE] Executable: [NONE] > > Bypass: 0 StartCount: 1090 > > PartitionMask: [ALL] > > Flags: RESTARTABLE > > > > Reservation '104' (00:00:00 -> 1:00:00 Duration: 1:00:00) > > PE: 4.00 StartPriority: 18 > > > > it says its RUNNING! but on the qstat the shows to be QUEUE... > > > > [rroman at zarate-0:~/outputs]$ qstat > > Job id Name User Time Use S > Queue > > ------------------------- ---------------- --------------- -------- - > ----- > > 104.zarate-0 a.t rroman 0 Q > dos > > > > Also i discovered with qstat -fl that: > > [rroman at zarate-0:~/outputs]$ qstat -fl > > Job Id: 104.zarate-0 > > Job_Name = a.t > > Job_Owner = rroman at zarate-0 > > job_state = Q > > queue = dos > > server = zarate-0 > > Checkpoint = u > > ctime = Wed Dec 7 11:46:03 2011 > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e104 > > Hold_Types = n > > Join_Path = n > > Keep_Files = n > > Mail_Points = a > > mtime = Wed Dec 7 12:07:02 2011 > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o104 > > Priority = 0 > > qtime = Wed Dec 7 11:46:03 2011 > > Rerunable = True > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > Resource_List.walltime = 01:00:00 > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > PBS_O_LOGNAME=rroman, > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > etime = Wed Dec 7 11:46:03 2011 > > exit_status = -3 > > submit_args = ../a.t > > start_time = Wed Dec 7 12:07:02 2011 > > Walltime.Remaining = 360 > > start_count = 1258 > > fault_tolerant = False > > > > > > The job has an exit status of -3 (which is "JOB_EXEC_RETRY -3 job > execution failed, do retry) but i dont have a clue why that happened and > why Maui and Torque report different status (running, queued and exit with > -3). > > > > > > > > While performing some test i realized that the problem is in zarate-2. > If i run the same job with 1 or 2 nodes (assiged to zarate-3) it works. But > when zarate-2 has to be assigned jobs i get this behavior. > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/c158b9c3/attachment.html From glen.beane at gmail.com Wed Dec 7 13:37:58 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Wed, 7 Dec 2011 15:37:58 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Message-ID: Sent from my iPhone On Dec 7, 2011, at 3:21 PM, Ricardo Rom?n Brenes wrote: > HI thanks for replying so fast =) > > In the script when i ask for 2 nodes, those 2 nodes are on hte same machine. dont ask me why but i think tis because the nodes files has np=2; when i ask for 3 nodes i get 2 times zarate-0 and 1 time zarate-1. If i ask for 4 nodes i get 2 times each node. > This is because of your Maui config, as we have mentioned in previous threads. Setting JOBNODEMATCHPOLICY to EXACTNODE will keep Maui from packing your job onto fewer nodes than it requests ( the default is EXACTPROC) > And thanks for pointing out that joined queue! =) i'm going to check the sanity of zarate-2 :P > > On Wed, Dec 7, 2011 at 12:37 PM, Gustavo Correa wrote: > Hi Ricardo > > You set only two nodes of type 'uno' and another two of type 'dos' > So, how come that in your PBS script you request four nodes of each type? > [You stripped off your PBS script from this email, it was in a previous message, > and it had the line "#PBS -l nodes=4" > > If I remember right your goal was to direct the jobs to the right type of node, correct? > However, you must have the resources that you are requesting. > If you ask for more than you have, you won't get it. > > My suggestion is. > Try one job with: > > #PBS -q uno > #PBS -l nodes=2:ppn=2 > ... > mpiexec -np=4 ... > > Try another job with: > > #PBS -q dos > #PBS -l nodes=2:ppn=2 > ... > mpiexec -np=4 ... > > If you also want to run jobs on four nodes, create a third queue, say 'unoedos', > and do *not* set the resources_default.neednodes on that one, so that any type > of node will go on that one. > Then try a job: > > #PBS -q unoedos > #PBS -l nodes=4:ppn=2 > ... > mpiexec -np=8 ... > > On which nodes do you expect each of the three jobs above to run on? > Does it make sense for you? > Well, save the case that zarate-2 has a problem, of course, > perhaps misinstalled pbs_mom or something else. > > I hope this helps ... and that it works ... :) > > Good luck, > Gus Correa > > On Dec 7, 2011, at 1:01 PM, Ricardo Rom?n Brenes wrote: > > > For gus =) : > > There is no error message it just stays in a status Q > > > > [root at zarate-0 ~]# cat /var/spool/pbs/server_priv/nodes > > zarate-0 np=2 uno > > zarate-1 np=2 uno > > zarate-2 np=2 dos > > zarate-3 np=2 dos > > > > > > [root at zarate-0 ~]# pbsnodes > > zarate-0 > > state = free > > np = 2 > > properties = uno > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-0 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=1601,nsessions=1,nusers=1,idletime=149,totmem=736568kb,availmem=628092kb,physmem=212288kb,ncpus=2,loadave=0.02,gres=,netload=51327373,state=free,jobs=,varattr=,rectime=1323279946 > > > > zarate-1 > > state = free > > np = 2 > > properties = uno > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-1 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85536,totmem=730040kb,availmem=650168kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=17064175,state=free,jobs=,varattr=,rectime=1323279949 > > > > zarate-2 > > state = free > > np = 2 > > properties = dos > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-2 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85527,totmem=737208kb,availmem=659320kb,physmem=212288kb,ncpus=2,loadave=0.04,gres=,netload=26043036,state=free,jobs=,varattr=,rectime=1323279953 > > > > zarate-3 > > state = free > > np = 2 > > properties = dos > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-3 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85538,totmem=737208kb,availmem=657236kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=58004536,state=free,jobs=,varattr=,rectime=1323279969 > > > > > > > > For Steve: > > > > this is hte checkjob: > > [rroman at zarate-0:~/outputs]$ checkjob 104 > > > > > > checking job 104 > > > > State: Running > > Creds: user:rroman group:usuariosCluster class:dos qos:DEFAULT > > WallTime: 00:00:00 of 1:00:00 > > SubmitTime: Wed Dec 7 11:46:03 > > (Time Queued Total: 00:18:11 Eligible: 00:18:11) > > > > StartTime: Wed Dec 7 12:04:14 > > Total Tasks: 4 > > > > Req[0] TaskCount: 4 Partition: DEFAULT > > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > > Opsys: [NONE] Arch: [NONE] Features: [dos] > > Allocated Nodes: > > [zarate-3:2][zarate-2:2] > > > > > > IWD: [NONE] Executable: [NONE] > > Bypass: 0 StartCount: 1090 > > PartitionMask: [ALL] > > Flags: RESTARTABLE > > > > Reservation '104' (00:00:00 -> 1:00:00 Duration: 1:00:00) > > PE: 4.00 StartPriority: 18 > > > > it says its RUNNING! but on the qstat the shows to be QUEUE... > > > > [rroman at zarate-0:~/outputs]$ qstat > > Job id Name User Time Use S Queue > > ------------------------- ---------------- --------------- -------- - ----- > > 104.zarate-0 a.t rroman 0 Q dos > > > > Also i discovered with qstat -fl that: > > [rroman at zarate-0:~/outputs]$ qstat -fl > > Job Id: 104.zarate-0 > > Job_Name = a.t > > Job_Owner = rroman at zarate-0 > > job_state = Q > > queue = dos > > server = zarate-0 > > Checkpoint = u > > ctime = Wed Dec 7 11:46:03 2011 > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e104 > > Hold_Types = n > > Join_Path = n > > Keep_Files = n > > Mail_Points = a > > mtime = Wed Dec 7 12:07:02 2011 > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o104 > > Priority = 0 > > qtime = Wed Dec 7 11:46:03 2011 > > Rerunable = True > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > Resource_List.walltime = 01:00:00 > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > PBS_O_LOGNAME=rroman, > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > etime = Wed Dec 7 11:46:03 2011 > > exit_status = -3 > > submit_args = ../a.t > > start_time = Wed Dec 7 12:07:02 2011 > > Walltime.Remaining = 360 > > start_count = 1258 > > fault_tolerant = False > > > > > > The job has an exit status of -3 (which is "JOB_EXEC_RETRY -3 job execution failed, do retry) but i dont have a clue why that happened and why Maui and Torque report different status (running, queued and exit with -3). > > > > > > > > While performing some test i realized that the problem is in zarate-2. If i run the same job with 1 or 2 nodes (assiged to zarate-3) it works. But when zarate-2 has to be assigned jobs i get this behavior. > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/014332c2/attachment-0001.html From gus at ldeo.columbia.edu Wed Dec 7 13:43:47 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 15:43:47 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Message-ID: Hi Ricardo I guess you are mixing up the notion of node with the notion of processor. For instance, in a hypothetical PBS directive #PBS -l nodes=3:ppn=2 one is asking for 3 nodes [i.e. two computers] with 2 CPUs [or cores] each. I would rather stick to the full syntax like above, using both 'nodes' *and* 'ppn' , to avoid confusion, and to prevent Torque from using defaults or making decisions for you. Of course, somebody will quickly object that the notion of 'node' or single computer is getting blurred with new technologies, etc, but I guess each of your 'zarate-?' computers still can be characterized as a separate computer, in the sense that they have a seprate OS copy running on each of them, separate motherboard, memory, etc, etc. I.e. they are four PCs, connected to each other via a small switch making a network, right? Also, note that both in the $TORQUE/sever_priv/nodes file the line: zarate-0 np=2 uno means that zarate-0 is one node, with the 'property' 'uno', which has two CPUs/cores [np=2]. It is not the number of CPU sockets in the motherboard that counts necessarily. If you have multicore processors, it is the total number of cores on zarate-0 that should be used. My recollection is that thousands of emails ago you said you had multicore machines. It gets a bit more confusing if you have hyperthreaded Intel processors, because they apear as two hyperthreaded cores for each physical core. Is it really np=2 in your case? Likewise, in the PBS directive #PBS -l nodes=3:ppn=2 one is asking for three nodes with two cores [at least] each. Good luck with uno, dos, and unoedos. :) I hope this helps, Gus Correa On Dec 7, 2011, at 3:21 PM, Ricardo Rom?n Brenes wrote: > HI thanks for replying so fast =) > > In the script when i ask for 2 nodes, those 2 nodes are on hte same machine. dont ask me why but i think tis because the nodes files has np=2; when i ask for 3 nodes i get 2 times zarate-0 and 1 time zarate-1. If i ask for 4 nodes i get 2 times each node. > > And thanks for pointing out that joined queue! =) i'm going to check the sanity of zarate-2 :P > > On Wed, Dec 7, 2011 at 12:37 PM, Gustavo Correa wrote: > Hi Ricardo > > You set only two nodes of type 'uno' and another two of type 'dos' > So, how come that in your PBS script you request four nodes of each type? > [You stripped off your PBS script from this email, it was in a previous message, > and it had the line "#PBS -l nodes=4" > > If I remember right your goal was to direct the jobs to the right type of node, correct? > However, you must have the resources that you are requesting. > If you ask for more than you have, you won't get it. > > My suggestion is. > Try one job with: > > #PBS -q uno > #PBS -l nodes=2:ppn=2 > ... > mpiexec -np=4 ... > > Try another job with: > > #PBS -q dos > #PBS -l nodes=2:ppn=2 > ... > mpiexec -np=4 ... > > If you also want to run jobs on four nodes, create a third queue, say 'unoedos', > and do *not* set the resources_default.neednodes on that one, so that any type > of node will go on that one. > Then try a job: > > #PBS -q unoedos > #PBS -l nodes=4:ppn=2 > ... > mpiexec -np=8 ... > > On which nodes do you expect each of the three jobs above to run on? > Does it make sense for you? > Well, save the case that zarate-2 has a problem, of course, > perhaps misinstalled pbs_mom or something else. > > I hope this helps ... and that it works ... :) > > Good luck, > Gus Correa > > On Dec 7, 2011, at 1:01 PM, Ricardo Rom?n Brenes wrote: > > > For gus =) : > > There is no error message it just stays in a status Q > > > > [root at zarate-0 ~]# cat /var/spool/pbs/server_priv/nodes > > zarate-0 np=2 uno > > zarate-1 np=2 uno > > zarate-2 np=2 dos > > zarate-3 np=2 dos > > > > > > [root at zarate-0 ~]# pbsnodes > > zarate-0 > > state = free > > np = 2 > > properties = uno > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-0 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=1601,nsessions=1,nusers=1,idletime=149,totmem=736568kb,availmem=628092kb,physmem=212288kb,ncpus=2,loadave=0.02,gres=,netload=51327373,state=free,jobs=,varattr=,rectime=1323279946 > > > > zarate-1 > > state = free > > np = 2 > > properties = uno > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-1 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85536,totmem=730040kb,availmem=650168kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=17064175,state=free,jobs=,varattr=,rectime=1323279949 > > > > zarate-2 > > state = free > > np = 2 > > properties = dos > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-2 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85527,totmem=737208kb,availmem=659320kb,physmem=212288kb,ncpus=2,loadave=0.04,gres=,netload=26043036,state=free,jobs=,varattr=,rectime=1323279953 > > > > zarate-3 > > state = free > > np = 2 > > properties = dos > > ntype = cluster > > status = opsys=linux,uname=Linux zarate-3 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85538,totmem=737208kb,availmem=657236kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=58004536,state=free,jobs=,varattr=,rectime=1323279969 > > > > > > > > For Steve: > > > > this is hte checkjob: > > [rroman at zarate-0:~/outputs]$ checkjob 104 > > > > > > checking job 104 > > > > State: Running > > Creds: user:rroman group:usuariosCluster class:dos qos:DEFAULT > > WallTime: 00:00:00 of 1:00:00 > > SubmitTime: Wed Dec 7 11:46:03 > > (Time Queued Total: 00:18:11 Eligible: 00:18:11) > > > > StartTime: Wed Dec 7 12:04:14 > > Total Tasks: 4 > > > > Req[0] TaskCount: 4 Partition: DEFAULT > > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > > Opsys: [NONE] Arch: [NONE] Features: [dos] > > Allocated Nodes: > > [zarate-3:2][zarate-2:2] > > > > > > IWD: [NONE] Executable: [NONE] > > Bypass: 0 StartCount: 1090 > > PartitionMask: [ALL] > > Flags: RESTARTABLE > > > > Reservation '104' (00:00:00 -> 1:00:00 Duration: 1:00:00) > > PE: 4.00 StartPriority: 18 > > > > it says its RUNNING! but on the qstat the shows to be QUEUE... > > > > [rroman at zarate-0:~/outputs]$ qstat > > Job id Name User Time Use S Queue > > ------------------------- ---------------- --------------- -------- - ----- > > 104.zarate-0 a.t rroman 0 Q dos > > > > Also i discovered with qstat -fl that: > > [rroman at zarate-0:~/outputs]$ qstat -fl > > Job Id: 104.zarate-0 > > Job_Name = a.t > > Job_Owner = rroman at zarate-0 > > job_state = Q > > queue = dos > > server = zarate-0 > > Checkpoint = u > > ctime = Wed Dec 7 11:46:03 2011 > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e104 > > Hold_Types = n > > Join_Path = n > > Keep_Files = n > > Mail_Points = a > > mtime = Wed Dec 7 12:07:02 2011 > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o104 > > Priority = 0 > > qtime = Wed Dec 7 11:46:03 2011 > > Rerunable = True > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > Resource_List.walltime = 01:00:00 > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > PBS_O_LOGNAME=rroman, > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > etime = Wed Dec 7 11:46:03 2011 > > exit_status = -3 > > submit_args = ../a.t > > start_time = Wed Dec 7 12:07:02 2011 > > Walltime.Remaining = 360 > > start_count = 1258 > > fault_tolerant = False > > > > > > The job has an exit status of -3 (which is "JOB_EXEC_RETRY -3 job execution failed, do retry) but i dont have a clue why that happened and why Maui and Torque report different status (running, queued and exit with -3). > > > > > > > > While performing some test i realized that the problem is in zarate-2. If i run the same job with 1 or 2 nodes (assiged to zarate-3) it works. But when zarate-2 has to be assigned jobs i get this behavior. > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Dec 7 14:01:25 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 15:01:25 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Message-ID: I'm very clear in the notions yet my mother language is spanish and i might be doing some mix up there :P zarate machines are PS3. They got a double PPC processor and several other processors but for now im just intrested in using the PPC. That's why Each node zarate-X has np=2. The installation on zarate-2 seems fine. I made a queue with just that node and the jobs run fine but somehow when i try to use zarate-2 and another zarate node the job stays queued: Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 107.zarate-0 a.t rroman 0 Q dos doing a tracejob: /var/spool/pbs/mom_logs/20111207: No matching job records located /var/spool/pbs/sched_logs/20111207: No such file or directory Job: 108.zarate-0 12/07/2011 15:08:12 S enqueuing into dos, state 1 hop 1 12/07/2011 15:08:12 S Job Queued at request of rroman at zarate-0, owner = rroman at zarate-0, job name = a.t, queue = dos 12/07/2011 15:08:12 A queue=dos 12/07/2011 15:08:13 S Job Run at request of root at zarate-0 12/07/2011 15:08:13 A user=rroman group=usuariosCluster jobname=a.t queue=dos ctime=1323292092 qtime=1323292092 etime=1323292092 start=1323292093 owner=rroman$ 12/07/2011 15:08:14 A user=rroman group=usuariosCluster jobname=a.t queue=dos ctime=1323292092 qtime=1323292092 etime=1323292092 start=1323292094 owner=rroman$ and that last line repeats itself until the current time which is 15:10 here in my computer :X -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/9e08268d/attachment.html From roman.ricardo at gmail.com Wed Dec 7 14:16:00 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 15:16:00 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Message-ID: and.... [rroman at zarate-0:~/outputs]$ qstat -fl Job Id: 108.zarate-0 Job_Name = a.t Job_Owner = rroman at zarate-0 job_state = Q queue = dos server = zarate-0 Checkpoint = u ctime = Wed Dec 7 15:08:12 2011 Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Dec 7 15:23:30 2011 Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 Priority = 0 qtime = Wed Dec 7 15:08:12 2011 Rerunable = True Resource_List.nodect = 4 Resource_List.nodes = 4 Resource_List.walltime = 01:00:00 Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, PBS_O_LOGNAME=rroman, PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos etime = Wed Dec 7 15:08:12 2011 exit_status = -3 submit_args = ../a.t -q dos -l nodes=4 start_time = Wed Dec 7 15:23:30 2011 Walltime.Remaining = 359 start_count = 918 fault_tolerant = False keeps saying exit_status -3 ! not sure what's happening here... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/e794e5ad/attachment.html From gus at ldeo.columbia.edu Wed Dec 7 14:28:30 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 16:28:30 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Message-ID: <425AD1A7-0837-4D0B-AC2C-B2F7D77CBCD5@ldeo.columbia.edu> Hi Ricardo On Dec 7, 2011, at 4:01 PM, Ricardo Rom?n Brenes wrote: > I'm very clear in the notions yet my mother language is spanish and i might be doing some mix up there :P > Oh well, no intent to belittle your domain of English, which is actually outstanding. BTW, my first language isn't English either. After all those emails back and forth I didn't think this would be an issue. The fact of matter is that the language where you might have tripped on, and which I and most people at one time or anohter tripped on, is the Torque jargon itself. That is what I was referring to when I mentioned the possible confusion between nodes and ppn. Actually, recently in the list somebody was pointing out that number of processors in the nodes file it is np=4, wheeas in the PBS directive it is ppn=4. Funny? > zarate machines are PS3. They got a double PPC processor and several other processors but for now im just intrested in using the PPC. That's why Each node zarate-X has np=2. > And presumably each PPC processor is single core, right? > The installation on zarate-2 seems fine. I made a queue with just that node and the jobs run fine but somehow when i try to use zarate-2 and another zarate node the job stays queued: > > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 107.zarate-0 a.t rroman 0 Q dos > > Is it possible that zarate-2 is not running pbs_mom? > doing a tracejob: > /var/spool/pbs/mom_logs/20111207: No matching job records located > /var/spool/pbs/sched_logs/20111207: No such file or directory > > Job: 108.zarate-0 > > 12/07/2011 15:08:12 S enqueuing into dos, state 1 hop 1 > 12/07/2011 15:08:12 S Job Queued at request of rroman at zarate-0, owner = rroman at zarate-0, job name = a.t, queue = dos > 12/07/2011 15:08:12 A queue=dos > 12/07/2011 15:08:13 S Job Run at request of root at zarate-0 > 12/07/2011 15:08:13 A user=rroman group=usuariosCluster jobname=a.t queue=dos ctime=1323292092 qtime=1323292092 etime=1323292092 start=1323292093 owner=rroman$ > 12/07/2011 15:08:14 A user=rroman group=usuariosCluster jobname=a.t queue=dos ctime=1323292092 qtime=1323292092 etime=1323292092 start=1323292094 owner=rroman$ > > and that last line repeats itself until the current time which is 15:10 here in my computer :X > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers Do all the other nodes work fine? [zarate-0, zarate-1, zarate-3]? Have you tried to force the job to run with 'qrun job_number' [as root]? On a different note, I don't know what you plan to do in your cluster, but note that if you have also Xeon machine [I think you said that long ago] that may become part of the cluster later, you may have some difficulties when trying to run MPI parallel jobs across the different architectures [PPC and Intel, for instance]. Anyway, maybe this is exactly why you are trying to make separate queues for different types of nodes. I hope this helps, Gus Correa From gus at ldeo.columbia.edu Wed Dec 7 14:37:46 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 16:37:46 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> Message-ID: <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> Hi Ricardo These two lines: > Resource_List.nodect = 4 > Resource_List.nodes = 4 suggest that you continue to ask for 4 nodes, like this: #PBS nodes=4 on queue 'dos'. Or am I mistaken? Have you tried this? #PBS -q dos #PBS -l nodes=2:ppn=2 As per a previous email, your pbs_server/nodes file lists only two nodes of type 'dos', not four. If so, you don't have the nodes that requested on the job script, which may explain why the job doesn't run. My two cents, Gus Correa On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > and.... > > [rroman at zarate-0:~/outputs]$ qstat -fl > Job Id: 108.zarate-0 > Job_Name = a.t > Job_Owner = rroman at zarate-0 > job_state = Q > queue = dos > server = zarate-0 > Checkpoint = u > ctime = Wed Dec 7 15:08:12 2011 > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = a > mtime = Wed Dec 7 15:23:30 2011 > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > Priority = 0 > qtime = Wed Dec 7 15:08:12 2011 > Rerunable = True > Resource_List.nodect = 4 > Resource_List.nodes = 4 > Resource_List.walltime = 01:00:00 > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > PBS_O_LOGNAME=rroman, > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > etime = Wed Dec 7 15:08:12 2011 > exit_status = -3 > submit_args = ../a.t -q dos -l nodes=4 > start_time = Wed Dec 7 15:23:30 2011 > Walltime.Remaining = 359 > start_count = 918 > fault_tolerant = False > > keeps saying exit_status -3 ! not sure what's happening here... > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Dec 7 14:37:46 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 15:37:46 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <425AD1A7-0837-4D0B-AC2C-B2F7D77CBCD5@ldeo.columbia.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <425AD1A7-0837-4D0B-AC2C-B2F7D77CBCD5@ldeo.columbia.edu> Message-ID: On Wed, Dec 7, 2011 at 3:28 PM, Gustavo Correa wrote: > Hi Ricardo > > On Dec 7, 2011, at 4:01 PM, Ricardo Rom?n Brenes wrote: > > > I'm very clear in the notions yet my mother language is spanish and i > might be doing some mix up there :P > > > Oh well, no intent to belittle your domain of English, which is actually > outstanding. > BTW, my first language isn't English either. > No offense taken =) > After all those emails back and forth I didn't think this would be an > issue. > > The fact of matter is that the language where you might have tripped on, > and which I and > most people at one time or anohter tripped on, is the Torque jargon itself. > That is what I was referring to when I mentioned the possible confusion > between nodes > and ppn. > Actually, recently in the list somebody was pointing out that number of > processors > in the nodes file it is np=4, wheeas in the PBS directive it is ppn=4. > Funny? > > hehe yeah i saw that message too :P > > > zarate machines are PS3. They got a double PPC processor and several > other processors but for now im just intrested in using the PPC. That's > why Each node zarate-X has np=2. > > > And presumably each PPC processor is single core, right? > > 1) And no i'm sorry the PPC processor is called *PPE. *And it's a two-way multithreaded core acting as the controller for the eight SPEs > > The installation on zarate-2 seems fine. I made a queue with just that > node and the jobs run fine but somehow when i try to use zarate-2 and > another zarate node the job stays queued: > > > > Job id Name User Time Use S > Queue > > ------------------------- ---------------- --------------- -------- - > ----- > > 107.zarate-0 a.t rroman 0 Q > dos > > > > > > Is it possible that zarate-2 is not running pbs_mom? > > 2) root 2564 0.2 4.5 10864 9644 ? SLs 11:50 0:41 pbs_mom > > doing a tracejob: > > /var/spool/pbs/mom_logs/20111207: No matching job records located > > /var/spool/pbs/sched_logs/20111207: No such file or directory > > > > Job: 108.zarate-0 > > > > 12/07/2011 15:08:12 S enqueuing into dos, state 1 hop 1 > > 12/07/2011 15:08:12 S Job Queued at request of rroman at zarate-0, > owner = rroman at zarate-0, job name = a.t, queue = dos > > 12/07/2011 15:08:12 A queue=dos > > 12/07/2011 15:08:13 S Job Run at request of root at zarate-0 > > 12/07/2011 15:08:13 A user=rroman group=usuariosCluster jobname=a.t > queue=dos ctime=1323292092 qtime=1323292092 etime=1323292092 > start=1323292093 owner=rroman$ > > 12/07/2011 15:08:14 A user=rroman group=usuariosCluster jobname=a.t > queue=dos ctime=1323292092 qtime=1323292092 etime=1323292092 > start=1323292094 owner=rroman$ > > > > and that last line repeats itself until the current time which is 15:10 > here in my computer :X > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > Do all the other nodes work fine? [zarate-0, zarate-1, zarate-3]? > 3) yes the ohter 3 nodes seems to work fine. > Have you tried to force the job to run with 'qrun job_number' [as root]? > 4) with what purpose? what does that do? > > On a different note, I don't know what you plan to do in your cluster, but > note that > if you have also Xeon machine [I think you said that long ago] that may > become > part of the cluster later, you may have some difficulties when trying to > run MPI parallel > jobs across the different architectures [PPC and Intel, for instance]. > Anyway, maybe this is exactly why you are trying to make separate queues > for different types of > nodes. > > Indeed thats what im tyring to do =) we'll have 5 xeon nodes wiith 3 of them with 2 teslas, a visualization machine and the PS3; i need the different queues so the login server can send jobs to each of the corresponding nodes. > I hope this helps, > Gus Correa > it always does! > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/67d53290/attachment.html From roman.ricardo at gmail.com Wed Dec 7 14:41:13 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 15:41:13 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> Message-ID: yeah i have nodes=4, but if it works on queue uno why not on queue dos? On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa wrote: > Hi Ricardo > > These two lines: > > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > suggest that you continue to ask for 4 nodes, like this: > > #PBS nodes=4 > > on queue 'dos'. > Or am I mistaken? > > Have you tried this? > > #PBS -q dos > #PBS -l nodes=2:ppn=2 > > As per a previous email, your pbs_server/nodes file lists only two nodes > of type 'dos', not four. > If so, you don't have the nodes that requested on the job script, > which may explain why the job doesn't run. > > My two cents, > Gus Correa > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > and.... > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > Job Id: 108.zarate-0 > > Job_Name = a.t > > Job_Owner = rroman at zarate-0 > > job_state = Q > > queue = dos > > server = zarate-0 > > Checkpoint = u > > ctime = Wed Dec 7 15:08:12 2011 > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > Hold_Types = n > > Join_Path = n > > Keep_Files = n > > Mail_Points = a > > mtime = Wed Dec 7 15:23:30 2011 > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > Priority = 0 > > qtime = Wed Dec 7 15:08:12 2011 > > Rerunable = True > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > Resource_List.walltime = 01:00:00 > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > PBS_O_LOGNAME=rroman, > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > etime = Wed Dec 7 15:08:12 2011 > > exit_status = -3 > > submit_args = ../a.t -q dos -l nodes=4 > > start_time = Wed Dec 7 15:23:30 2011 > > Walltime.Remaining = 359 > > start_count = 918 > > fault_tolerant = False > > > > keeps saying exit_status -3 ! not sure what's happening here... > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/d36334f7/attachment-0001.html From roman.ricardo at gmail.com Wed Dec 7 14:41:55 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 15:41:55 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> Message-ID: [rroman at zarate-0:~/outputs]$ qsub ../a.t -q uno -l nodes=4 109.zarate-0 [rroman at zarate-0:~/outputs]$ ls a.t.e109 a.t.o109 [rroman at zarate-0:~/outputs]$ cat a.t.o109 Nodes Assigned: zarate-1 zarate-1 zarate-0 zarate-0 running... -l nodes=2 && -n 1 zarate-1: hello world from process 0 of 1 running... -l nodes=2 && -n 2 zarate-1: hello world from process 1 of 2 zarate-1: hello world from process 0 of 2 running... -l nodes=2 && -n 4 zarate-1: hello world from process 1 of 4 zarate-1: hello world from process 0 of 4 zarate-0: hello world from process 2 of 4 zarate-0: hello world from process 3 of 4 running... -l nodes=2 && -n 8 zarate-1: hello world from process 0 of 8 zarate-1: hello world from process 4 of 8 zarate-0: hello world from process 7 of 8 zarate-1: hello world from process 1 of 8 zarate-0: hello world from process 2 of 8 zarate-1: hello world from process 5 of 8 zarate-0: hello world from process 6 of 8 zarate-0: hello world from process 3 of 8 done On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa wrote: > Hi Ricardo > > These two lines: > > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > suggest that you continue to ask for 4 nodes, like this: > > #PBS nodes=4 > > on queue 'dos'. > Or am I mistaken? > > Have you tried this? > > #PBS -q dos > #PBS -l nodes=2:ppn=2 > > As per a previous email, your pbs_server/nodes file lists only two nodes > of type 'dos', not four. > If so, you don't have the nodes that requested on the job script, > which may explain why the job doesn't run. > > My two cents, > Gus Correa > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > and.... > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > Job Id: 108.zarate-0 > > Job_Name = a.t > > Job_Owner = rroman at zarate-0 > > job_state = Q > > queue = dos > > server = zarate-0 > > Checkpoint = u > > ctime = Wed Dec 7 15:08:12 2011 > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > Hold_Types = n > > Join_Path = n > > Keep_Files = n > > Mail_Points = a > > mtime = Wed Dec 7 15:23:30 2011 > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > Priority = 0 > > qtime = Wed Dec 7 15:08:12 2011 > > Rerunable = True > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > Resource_List.walltime = 01:00:00 > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > PBS_O_LOGNAME=rroman, > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > etime = Wed Dec 7 15:08:12 2011 > > exit_status = -3 > > submit_args = ../a.t -q dos -l nodes=4 > > start_time = Wed Dec 7 15:23:30 2011 > > Walltime.Remaining = 359 > > start_count = 918 > > fault_tolerant = False > > > > keeps saying exit_status -3 ! not sure what's happening here... > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/f6469e85/attachment.html From gus at ldeo.columbia.edu Wed Dec 7 14:50:15 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 16:50:15 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> Message-ID: <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> Hi Ricardo On Dec 7, 2011, at 4:41 PM, Ricardo Rom?n Brenes wrote: > yeah i have nodes=4, but if it works on queue uno why not on queue dos? > I don't know the answer. it may have to do with defaults and or decisions that Torque may make for you when you are not specific about the ppn in your PBS directives, but I don't really know. Maybe the Torque developers have a hint. However, you didn't answer my previous question either: > Have you tried this? > > #PBS -q dos > #PBS -l nodes=2:ppn=2 Did you try it? Gus Correa > On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa wrote: > Hi Ricardo > > These two lines: > > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > suggest that you continue to ask for 4 nodes, like this: > > #PBS nodes=4 > > on queue 'dos'. > Or am I mistaken? > > Have you tried this? > > #PBS -q dos > #PBS -l nodes=2:ppn=2 > > As per a previous email, your pbs_server/nodes file lists only two nodes of type 'dos', not four. > If so, you don't have the nodes that requested on the job script, > which may explain why the job doesn't run. > > My two cents, > Gus Correa > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > and.... > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > Job Id: 108.zarate-0 > > Job_Name = a.t > > Job_Owner = rroman at zarate-0 > > job_state = Q > > queue = dos > > server = zarate-0 > > Checkpoint = u > > ctime = Wed Dec 7 15:08:12 2011 > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > Hold_Types = n > > Join_Path = n > > Keep_Files = n > > Mail_Points = a > > mtime = Wed Dec 7 15:23:30 2011 > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > Priority = 0 > > qtime = Wed Dec 7 15:08:12 2011 > > Rerunable = True > > Resource_List.nodect = 4 > > Resource_List.nodes = 4 > > Resource_List.walltime = 01:00:00 > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > PBS_O_LOGNAME=rroman, > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > etime = Wed Dec 7 15:08:12 2011 > > exit_status = -3 > > submit_args = ../a.t -q dos -l nodes=4 > > start_time = Wed Dec 7 15:23:30 2011 > > Walltime.Remaining = 359 > > start_count = 918 > > fault_tolerant = False > > > > keeps saying exit_status -3 ! not sure what's happening here... > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Dec 7 14:52:08 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 15:52:08 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> Message-ID: oops i missed that part! but it doenst work. It works on queue uno but not on queue dos (its the same behavoir as before with nodes=4). i thinking in reinstall the pbs_mom (aka recompiling...) On Wed, Dec 7, 2011 at 3:50 PM, Gustavo Correa wrote: > Hi Ricardo > On Dec 7, 2011, at 4:41 PM, Ricardo Rom?n Brenes wrote: > > > yeah i have nodes=4, but if it works on queue uno why not on queue dos? > > > I don't know the answer. > it may have to do with defaults and or decisions that Torque may make for > you > when you are not specific about the ppn in your PBS directives, but I > don't really know. > Maybe the Torque developers have a hint. > > However, you didn't answer my previous question either: > > > Have you tried this? > > > > #PBS -q dos > > #PBS -l nodes=2:ppn=2 > > Did you try it? > > Gus Correa > > > > On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa > wrote: > > Hi Ricardo > > > > These two lines: > > > > > Resource_List.nodect = 4 > > > Resource_List.nodes = 4 > > > > suggest that you continue to ask for 4 nodes, like this: > > > > #PBS nodes=4 > > > > on queue 'dos'. > > Or am I mistaken? > > > > Have you tried this? > > > > #PBS -q dos > > #PBS -l nodes=2:ppn=2 > > > > As per a previous email, your pbs_server/nodes file lists only two nodes > of type 'dos', not four. > > If so, you don't have the nodes that requested on the job script, > > which may explain why the job doesn't run. > > > > My two cents, > > Gus Correa > > > > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > > > and.... > > > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > > Job Id: 108.zarate-0 > > > Job_Name = a.t > > > Job_Owner = rroman at zarate-0 > > > job_state = Q > > > queue = dos > > > server = zarate-0 > > > Checkpoint = u > > > ctime = Wed Dec 7 15:08:12 2011 > > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > > Hold_Types = n > > > Join_Path = n > > > Keep_Files = n > > > Mail_Points = a > > > mtime = Wed Dec 7 15:23:30 2011 > > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > > Priority = 0 > > > qtime = Wed Dec 7 15:08:12 2011 > > > Rerunable = True > > > Resource_List.nodect = 4 > > > Resource_List.nodes = 4 > > > Resource_List.walltime = 01:00:00 > > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > > PBS_O_LOGNAME=rroman, > > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > > etime = Wed Dec 7 15:08:12 2011 > > > exit_status = -3 > > > submit_args = ../a.t -q dos -l nodes=4 > > > start_time = Wed Dec 7 15:23:30 2011 > > > Walltime.Remaining = 359 > > > start_count = 918 > > > fault_tolerant = False > > > > > > keeps saying exit_status -3 ! not sure what's happening here... > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/e645f845/attachment-0001.html From gus at ldeo.columbia.edu Wed Dec 7 15:02:03 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 17:02:03 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> Message-ID: <58ACD013-F96D-4F16-A7C7-8CC703768AC4@ldeo.columbia.edu> What if you switch the nodes assigned to queues uno and dos? I.e., change your nodes file from zarate-0 np=2 dos zarate-1 np=2 dos zarate-2 np=2 uno zarate-3 np=2 uno to zarate-0 np=2 uno zarate-1 np=2 uno zarate-2 np=2 dos zarate-3 np=2 dos restart pbs_server, then submit your test jobs again. In any case, I would stay away of the syntax nodes=4. I don't know how Torque parses it. I think it may be contributing to defeat your very goal. I would stick to the style nodes=2:ppn=2. Gus Correa On Dec 7, 2011, at 4:52 PM, Ricardo Rom?n Brenes wrote: > oops i missed that part! but it doenst work. It works on queue uno but not on queue dos (its the same behavoir as before with nodes=4). > > i thinking in reinstall the pbs_mom (aka recompiling...) > > On Wed, Dec 7, 2011 at 3:50 PM, Gustavo Correa wrote: > Hi Ricardo > On Dec 7, 2011, at 4:41 PM, Ricardo Rom?n Brenes wrote: > > > yeah i have nodes=4, but if it works on queue uno why not on queue dos? > > > I don't know the answer. > it may have to do with defaults and or decisions that Torque may make for you > when you are not specific about the ppn in your PBS directives, but I don't really know. > Maybe the Torque developers have a hint. > > However, you didn't answer my previous question either: > > > Have you tried this? > > > > #PBS -q dos > > #PBS -l nodes=2:ppn=2 > > Did you try it? > > Gus Correa > > > > On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa wrote: > > Hi Ricardo > > > > These two lines: > > > > > Resource_List.nodect = 4 > > > Resource_List.nodes = 4 > > > > suggest that you continue to ask for 4 nodes, like this: > > > > #PBS nodes=4 > > > > on queue 'dos'. > > Or am I mistaken? > > > > Have you tried this? > > > > #PBS -q dos > > #PBS -l nodes=2:ppn=2 > > > > As per a previous email, your pbs_server/nodes file lists only two nodes of type 'dos', not four. > > If so, you don't have the nodes that requested on the job script, > > which may explain why the job doesn't run. > > > > My two cents, > > Gus Correa > > > > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > > > and.... > > > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > > Job Id: 108.zarate-0 > > > Job_Name = a.t > > > Job_Owner = rroman at zarate-0 > > > job_state = Q > > > queue = dos > > > server = zarate-0 > > > Checkpoint = u > > > ctime = Wed Dec 7 15:08:12 2011 > > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > > Hold_Types = n > > > Join_Path = n > > > Keep_Files = n > > > Mail_Points = a > > > mtime = Wed Dec 7 15:23:30 2011 > > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > > Priority = 0 > > > qtime = Wed Dec 7 15:08:12 2011 > > > Rerunable = True > > > Resource_List.nodect = 4 > > > Resource_List.nodes = 4 > > > Resource_List.walltime = 01:00:00 > > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > > PBS_O_LOGNAME=rroman, > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > > etime = Wed Dec 7 15:08:12 2011 > > > exit_status = -3 > > > submit_args = ../a.t -q dos -l nodes=4 > > > start_time = Wed Dec 7 15:23:30 2011 > > > Walltime.Remaining = 359 > > > start_count = 918 > > > fault_tolerant = False > > > > > > keeps saying exit_status -3 ! not sure what's happening here... > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Dec 7 15:21:24 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 16:21:24 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <58ACD013-F96D-4F16-A7C7-8CC703768AC4@ldeo.columbia.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> <58ACD013-F96D-4F16-A7C7-8CC703768AC4@ldeo.columbia.edu> Message-ID: ok so, switched: [root at zarate-0 ~]# pbsnodes zarate-0 state = free np = 2 properties = dos ntype = cluster status = opsys=linux,uname=Linux zarate-0 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=1601,nsessions=1,nusers=1,idletime=2695,totmem=736568kb,availmem=627116kb,physmem=212288kb,ncpus=2,loadave=0.14,gres=,netload=223300804,state=free,jobs=,varattr=,rectime=1323296933 zarate-1 state = free np = 2 properties = dos ntype = cluster status = opsys=linux,uname=Linux zarate-1 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=102521,totmem=730040kb,availmem=648564kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=18457772,state=free,jobs=,varattr=,rectime=1323296935 zarate-2 state = free np = 2 properties = uno ntype = cluster status = opsys=linux,uname=Linux zarate-2 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=102506,totmem=737208kb,availmem=661780kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=82050098,state=free,jobs=,varattr=,rectime=1323296935 zarate-3 state = free np = 2 properties = uno ntype = cluster status = opsys=linux,uname=Linux zarate-3 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=102505,totmem=737208kb,availmem=662512kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=289492564,state=free,jobs=,varattr=,rectime=1323296935 and... [rroman at zarate-0:~/outputs]$ qsub -l nodes=2:ppn=2 -q uno ../a.t 115.zarate-0 [rroman at zarate-0:~/outputs]$ qsub -l nodes=2:ppn=2 -q dos ../a.t 116.zarate-0 [rroman at zarate-0:~/outputs]$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 115.zarate-0 a.t rroman 0 Q uno -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/cae3c3dc/attachment.html From gus at ldeo.columbia.edu Wed Dec 7 15:30:50 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 17:30:50 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> Message-ID: <64159374-27FD-466C-9C15-7F349896D8A3@ldeo.columbia.edu> From the Torque Administrator Guide, version 3.0.3: "By default, the node resource is mapped to a virtual node (that is, directly to a processor, not a full physical compute node). This behavior can be changed within Maui or Moab by setting the JOBNODEMATCHPOLICY parameter. See Appendix F of the Moab Workload Manager Administrator's Guide for more information." Besides how 'node' is mapped, and the value of specifying also ppn= in your #PBS -l directive, do you remember what Glen Beane told you also about Maiu JOBNODEMATCHPOLICY? The Torque Admin Guide is a highly recommended reading! :) http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/index.php http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php#resources http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php#nodeExamples Likewise for the Maui Admin Guide: http://www.adaptivecomputing.com/resources/docs/maui/index.php Cheers, Gus Correa On Dec 7, 2011, at 4:52 PM, Ricardo Rom?n Brenes wrote: > oops i missed that part! but it doenst work. It works on queue uno but not on queue dos (its the same behavoir as before with nodes=4). > > i thinking in reinstall the pbs_mom (aka recompiling...) > > On Wed, Dec 7, 2011 at 3:50 PM, Gustavo Correa wrote: > Hi Ricardo > On Dec 7, 2011, at 4:41 PM, Ricardo Rom?n Brenes wrote: > > > yeah i have nodes=4, but if it works on queue uno why not on queue dos? > > > I don't know the answer. > it may have to do with defaults and or decisions that Torque may make for you > when you are not specific about the ppn in your PBS directives, but I don't really know. > Maybe the Torque developers have a hint. > > However, you didn't answer my previous question either: > > > Have you tried this? > > > > #PBS -q dos > > #PBS -l nodes=2:ppn=2 > > Did you try it? > > Gus Correa > > > > On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa wrote: > > Hi Ricardo > > > > These two lines: > > > > > Resource_List.nodect = 4 > > > Resource_List.nodes = 4 > > > > suggest that you continue to ask for 4 nodes, like this: > > > > #PBS nodes=4 > > > > on queue 'dos'. > > Or am I mistaken? > > > > Have you tried this? > > > > #PBS -q dos > > #PBS -l nodes=2:ppn=2 > > > > As per a previous email, your pbs_server/nodes file lists only two nodes of type 'dos', not four. > > If so, you don't have the nodes that requested on the job script, > > which may explain why the job doesn't run. > > > > My two cents, > > Gus Correa > > > > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > > > and.... > > > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > > Job Id: 108.zarate-0 > > > Job_Name = a.t > > > Job_Owner = rroman at zarate-0 > > > job_state = Q > > > queue = dos > > > server = zarate-0 > > > Checkpoint = u > > > ctime = Wed Dec 7 15:08:12 2011 > > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > > Hold_Types = n > > > Join_Path = n > > > Keep_Files = n > > > Mail_Points = a > > > mtime = Wed Dec 7 15:23:30 2011 > > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > > Priority = 0 > > > qtime = Wed Dec 7 15:08:12 2011 > > > Rerunable = True > > > Resource_List.nodect = 4 > > > Resource_List.nodes = 4 > > > Resource_List.walltime = 01:00:00 > > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > > PBS_O_LOGNAME=rroman, > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > > etime = Wed Dec 7 15:08:12 2011 > > > exit_status = -3 > > > submit_args = ../a.t -q dos -l nodes=4 > > > start_time = Wed Dec 7 15:23:30 2011 > > > Walltime.Remaining = 359 > > > start_count = 918 > > > fault_tolerant = False > > > > > > keeps saying exit_status -3 ! not sure what's happening here... > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Dec 7 15:43:03 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 7 Dec 2011 16:43:03 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: <64159374-27FD-466C-9C15-7F349896D8A3@ldeo.columbia.edu> References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> <64159374-27FD-466C-9C15-7F349896D8A3@ldeo.columbia.edu> Message-ID: yeah but that would solve the mystery about hte nodes=4 and nodes=2:ppn=2 but what can i do about hte queues... Ill try to recompile the pbs but... im not sensing that its going to that way... On Wed, Dec 7, 2011 at 4:30 PM, Gustavo Correa wrote: > >From the Torque Administrator Guide, version 3.0.3: > > "By default, the node resource is mapped to a virtual node (that is, > directly to a processor, not a full physical compute node). This behavior > can be changed within Maui or Moab by setting the JOBNODEMATCHPOLICY > parameter. See Appendix F of the Moab Workload Manager Administrator's > Guide for more information." > > Besides how 'node' is mapped, and the value of specifying also ppn= in > your #PBS -l directive, > do you remember what Glen Beane told you also about Maiu > JOBNODEMATCHPOLICY? > > The Torque Admin Guide is a highly recommended reading! :) > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/index.php > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php#resources > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php#nodeExamples > > Likewise for the Maui Admin Guide: > http://www.adaptivecomputing.com/resources/docs/maui/index.php > > Cheers, > Gus Correa > > On Dec 7, 2011, at 4:52 PM, Ricardo Rom?n Brenes wrote: > > > oops i missed that part! but it doenst work. It works on queue uno but > not on queue dos (its the same behavoir as before with nodes=4). > > > > i thinking in reinstall the pbs_mom (aka recompiling...) > > > > On Wed, Dec 7, 2011 at 3:50 PM, Gustavo Correa > wrote: > > Hi Ricardo > > On Dec 7, 2011, at 4:41 PM, Ricardo Rom?n Brenes wrote: > > > > > yeah i have nodes=4, but if it works on queue uno why not on queue dos? > > > > > I don't know the answer. > > it may have to do with defaults and or decisions that Torque may make > for you > > when you are not specific about the ppn in your PBS directives, but I > don't really know. > > Maybe the Torque developers have a hint. > > > > However, you didn't answer my previous question either: > > > > > Have you tried this? > > > > > > #PBS -q dos > > > #PBS -l nodes=2:ppn=2 > > > > Did you try it? > > > > Gus Correa > > > > > > > On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa > wrote: > > > Hi Ricardo > > > > > > These two lines: > > > > > > > Resource_List.nodect = 4 > > > > Resource_List.nodes = 4 > > > > > > suggest that you continue to ask for 4 nodes, like this: > > > > > > #PBS nodes=4 > > > > > > on queue 'dos'. > > > Or am I mistaken? > > > > > > Have you tried this? > > > > > > #PBS -q dos > > > #PBS -l nodes=2:ppn=2 > > > > > > As per a previous email, your pbs_server/nodes file lists only two > nodes of type 'dos', not four. > > > If so, you don't have the nodes that requested on the job script, > > > which may explain why the job doesn't run. > > > > > > My two cents, > > > Gus Correa > > > > > > > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > > > > > and.... > > > > > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > > > Job Id: 108.zarate-0 > > > > Job_Name = a.t > > > > Job_Owner = rroman at zarate-0 > > > > job_state = Q > > > > queue = dos > > > > server = zarate-0 > > > > Checkpoint = u > > > > ctime = Wed Dec 7 15:08:12 2011 > > > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > > > Hold_Types = n > > > > Join_Path = n > > > > Keep_Files = n > > > > Mail_Points = a > > > > mtime = Wed Dec 7 15:23:30 2011 > > > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > > > Priority = 0 > > > > qtime = Wed Dec 7 15:08:12 2011 > > > > Rerunable = True > > > > Resource_List.nodect = 4 > > > > Resource_List.nodes = 4 > > > > Resource_List.walltime = 01:00:00 > > > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > > > PBS_O_LOGNAME=rroman, > > > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > > > etime = Wed Dec 7 15:08:12 2011 > > > > exit_status = -3 > > > > submit_args = ../a.t -q dos -l nodes=4 > > > > start_time = Wed Dec 7 15:23:30 2011 > > > > Walltime.Remaining = 359 > > > > start_count = 918 > > > > fault_tolerant = False > > > > > > > > keeps saying exit_status -3 ! not sure what's happening here... > > > > _______________________________________________ > > > > torqueusers mailing list > > > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/fed94cfb/attachment-0001.html From gus at ldeo.columbia.edu Wed Dec 7 15:47:32 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 17:47:32 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> <58ACD013-F96D-4F16-A7C7-8CC703768AC4@ldeo.columbia.edu> Message-ID: <84E1670A-16B3-44F0-AC79-25547FD19F4A@ldeo.columbia.edu> What do you conclude from this? The problem is now in queue uno? The intersection of the previous failures and this failure suggest that zarate-2 [and perhaps zarate3 too] may have a problem? But before you try to reinstall the whole thing, I would go back to a single queue, all nodes of the same type [or no type]. I.e. server_priv/nodes: zarate-0 np=2 zarate-1 np=2 zarate-2 np=2 zarate-3 np=2 and remove the attribute resources_default.neednodes from both queues [for test only], restart the pbs_server, and check if jobs would run on all nodes [say, with nodes=4:ppn=2] This may confirm whether the problem is on some of the zarates or if it resides on the Torque server and Maui scheduler setup. Did you see the other email reminding you of Glen Beane's suggestion? Good luck, Gus Correa On Dec 7, 2011, at 5:21 PM, Ricardo Rom?n Brenes wrote: > ok so, switched: > [root at zarate-0 ~]# pbsnodes > zarate-0 > state = free > np = 2 > properties = dos > ntype = cluster > status = opsys=linux,uname=Linux zarate-0 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=1601,nsessions=1,nusers=1,idletime=2695,totmem=736568kb,availmem=627116kb,physmem=212288kb,ncpus=2,loadave=0.14,gres=,netload=223300804,state=free,jobs=,varattr=,rectime=1323296933 > > zarate-1 > state = free > np = 2 > properties = dos > ntype = cluster > status = opsys=linux,uname=Linux zarate-1 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=102521,totmem=730040kb,availmem=648564kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=18457772,state=free,jobs=,varattr=,rectime=1323296935 > > zarate-2 > state = free > np = 2 > properties = uno > ntype = cluster > status = opsys=linux,uname=Linux zarate-2 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=102506,totmem=737208kb,availmem=661780kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=82050098,state=free,jobs=,varattr=,rectime=1323296935 > > zarate-3 > state = free > np = 2 > properties = uno > ntype = cluster > status = opsys=linux,uname=Linux zarate-3 2.6.29.4-167.fc11.ppc64 #1 SMP Wed May 27 17:18:17 EDT 2009 ppc64,sessions=? 0,nsessions=? 0,nusers=0,idletime=102505,totmem=737208kb,availmem=662512kb,physmem=212288kb,ncpus=2,loadave=0.00,gres=,netload=289492564,state=free,jobs=,varattr=,rectime=1323296935 > > > and... > > [rroman at zarate-0:~/outputs]$ qsub -l nodes=2:ppn=2 -q uno ../a.t > 115.zarate-0 > [rroman at zarate-0:~/outputs]$ qsub -l nodes=2:ppn=2 -q dos ../a.t > 116.zarate-0 > [rroman at zarate-0:~/outputs]$ qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 115.zarate-0 a.t rroman 0 Q uno > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Wed Dec 7 16:05:27 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 8 Dec 2011 10:05:27 +1100 Subject: [torqueusers] Beginner Problem: MSG=cannot locate feasible nodes : Suggested error message change. In-Reply-To: <242421BFAF465844BE24EB90BB97E221017DDFA5@ITSDAG3D.its.iastate.edu> References: <4EDAA6B7.1040106@gmail.com> <242421BFAF465844BE24EB90BB97E221017DDE2F@ITSDAG3D.its.iastate.edu> <007DECE986B47F4EABF823C1FBB19C620102C6360AC8@exvic-mbx04.nexus.csiro.au> <242421BFAF465844BE24EB90BB97E221017DDFA5@ITSDAG3D.its.iastate.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C8732E45@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Coyle, James J [ITACD] [mailto:jjc at iastate.edu] > Sent: Wednesday, 7 December 2011 4:29 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Beginner Problem: MSG=cannot locate feasible > nodes : Suggested error message change. > > > Gareth, > > Since you asked about whether the message is good, I'd recommend > a change in the message. Nice effort tying down the error handling code James! Ken, is it in Adaptive's interest to pick improving the error message up as a development request for at least this case? The priority is clearly not critical (the existing error message is useful if cryptic). If there were a bugzilla entry referencing this thread would it get acted on any time soon? Regards, Gareth > > I've always thought that the np= syntax in the node file and > from pbsnodes is inconsistent with the ppn= syntax in the > qsub request. > > > Note that in the checks in function > static int proplist( > ( In the 2.5.4 version which I am running, in the source > file server/node_manager.c ) > > when : is found in the nodes= portion of the job requirements, > then if "=" is found in the string following it, then > the function checks whether this is one of the > "special properties" npp, procs or gpu. > > 1) if "ppn" , "procs", and "gpus" are found, > node_req is checked for a number (a positive integer) > if one is not found, "1" is returned from proplist() > 2) if none of the above special properties is found > again a "1" is returned from proplist() > > I suggest that the return from proplist should be > different in these two cases > e.g. 255 when "xxx" is not recognized in "xxx=yyy" , > and perhaps in all 1,2,3 in the error returns from > the ppn=, procs= and gpus= number checks. > > Then a more meaningful test could be performed on > return from proplist. (I'd also store the offending > string to shows the user what was unacceptable.) > > > E.g. for error return 255, > > Job requirement specification: > nodes=2:xxx=27 > is not a valid request. > "xxx" is not an acceptable special property, > only ppn= , procs= and cpus= are acceptable here. > > For error return 3, > > Job requirement specification: > nodes=2:gpus=yyy > is not a valid request. > String "yyy" after gpus= must be a positive > integer. > > > In general, I believe that error messages should let the user know > 1) what is wrong in what they wrote, and > 2) (if possible) how to change what they wrote into an something > acceptable. > > > > More specifically change the code segment > if (strcmp(pname, "ppn") == 0) > { > pequal++; > > if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) > { > return(1); > } > } > else if(strcmp(pname, "procs") == 0) > { > pequal++; > if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) > { > return(1); > } > } > else if (strcmp(pname, "gpus") == 0) > { > pequal++; > > if ((number(&pequal, gpu_req) != 0) || (*pequal != '\0')) > { > return(1); > } > } > else > { > return(1); /* not recognized - error */ > } > > in server/node_manager.c to: > > if (strcmp(pname, "ppn") == 0) > { > pequal++; > > if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) > { > return(1); /* ppn= number not recognized - error */ > } > } > else if(strcmp(pname, "procs") == 0) > { > pequal++; > if ((number(&pequal, node_req) != 0) || (*pequal != '\0')) > { > return(2); /* procs= number not recognized - error */ > } > } > else if (strcmp(pname, "gpus") == 0) > { > pequal++; > > if ((number(&pequal, gpu_req) != 0) || (*pequal != '\0')) > { > return(3); /* gpus= number not recognized - error */ > } > } > else > { > return(255); /* xxx= appears but xxx is not one of ppn , procs, > or gpus - error */ > } > > > > >-----Original Message----- > >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > >bounces at supercluster.org] On Behalf Of Gareth.Williams at csiro.au > >Sent: Tuesday, December 06, 2011 12:11 AM > >To: torqueusers at supercluster.org > >Subject: Re: [torqueusers] Beginner Problem: MSG=cannot locate > >feasible nodes > > > >> -----Original Message----- > >> From: Coyle, James J [ITACD] [mailto:jjc at iastate.edu] > >> Sent: Tuesday, 6 December 2011 7:03 AM > >> To: Torque Users Mailing List > >> Subject: Re: [torqueusers] Beginner Problem: MSG=cannot locate > >feasible > >> nodes > >> > >> I think that you have a typo. > >> > >> Try using ppn=2 rather than npp=2 > > > >You are also getting ncpus and procct set from the default_max > >numbers. This might be OK but might be problematic. I'd avoid > >ncpus but procct is probably OK as I think it gets stripped from the > >job as it is started anyway. > > > >All: is this a reasonable MSG? Would it be hard to make the feedback > >more direct in this case? Is npp=2 in this context clearly an error > >or could it be meaningful in a 'real' cluster configuration? > > > >Gareth > > > >> > >> > >> > >> > >> > >> >-----Original Message----- > >> >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > >> >bounces at supercluster.org] On Behalf Of Clarke Earley > >> >Sent: Saturday, December 03, 2011 4:46 PM > >> >To: torqueusers at supercluster.org > >> >Subject: [torqueusers] Beginner Problem: MSG=cannot locate > >feasible > >> >nodes > >> > > >> >I am in the process of setting up Torque and Maui on a 2 node > >> >cluster > >> >running under Debian. Compilation and installation ran without an > >> >problems > >> >and submission of simple test jobs ( $ echo "sleep 30" | qsub ) > >also > >> >ran > >> >without any issue. However, when I try to specify multiple > >nodes, > >> >the > >> >jobs fail as follows. > >> > > >> > > $ echo "sleep 30" | qsub -l nodes=2:npp=2 -q batch > >> > > qsub: Job exceeds queue resource limits MSG=cannot > >> >satisfy > >> >queue max nodes requirement > >> > > $ echo "sleep 30" | qsub -l nodes=1:npp=2 -q batch > >> > > qsub: Job exceeds queue resource limits MSG=cannot > >> >locate > >> >feasible nodes (nodes file is empty or all systems are busy) > >> > > >> >The file server_priv/nodes file exists on the master node > >> >(thebrain): > >> > > $ cat /var/spool/torque/server_priv/nodes > >> > > thebrain np=12 > >> > > yakko np=12 > >> > > >> >and appears to be recognized by pbsnodes > >> > > $ pbsnodes > >> > > thebrain > >> > > state = free > >> > > np = 12 > >> > > ntype = cluster > >> > > status = > >> > >>rectime=1322943942,varattr=,jobs=,state=free,netload=905815039,gres > >= > >> > >>,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=97029336kb,totme > >m > >> >=97457404kb,idletime=1692,nusers=2,nsessions=11,sessions=19903 > >> >24143 24149 24150 24151 24152 24896 24902 24903 24904 > >> >24905,uname=Linux > >> >chem 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 > >> >x86_64,opsys=linux > >> > > gpus = 0 > >> > > >> > > yakko > >> > > state = free > >> > > np = 12 > >> > > ntype = cluster > >> > > status = > >> > >>rectime=1322943940,varattr=,jobs=,state=free,netload=242901050,gres > >= > >> > >>,loadave=0.00,ncpus=24,physmem=33004284kb,availmem=56432072kb,totme > >m > >> >=56736504kb,idletime=3087,nusers=0,nsessions=? > >> >0,sessions=? 0,uname=Linux yakko 3.0.0-1-amd64 #1 SMP Sat Aug 27 > >> >16:21:11 UTC 2011 x86_64,opsys=linux > >> > > gpus = 0 > >> > > >> >The output of qstat appears to indicate that resources are > >> >available: > >> > > $ qstat -Qf > >> > > Queue: batch > >> > > queue_type = Execution > >> > > total_jobs = 0 > >> > > state_count = Transit:0 Queued:0 Held:0 > >Waiting:0 > >> >Running:0 Exiting:0 > >> > > resources_max.ncpus = 4 > >> > > resources_max.nodes = 2 > >> > > resources_max.procct = 24 > >> > > resources_default.nodes = 1 > >> > > resources_default.walltime = 01:00:00 > >> > > mtime = 1322940014 > >> > > resources_available.ncpus = 4 > >> > > resources_available.nodes = 2 > >> > > resources_available.procct = 24 > >> > > resources_assigned.nodect = 0 > >> > > enabled = True > >> > > started = True > >> > > >> >I did not see anything in the log files that appeared helpful. > >Any > >> >suggestions would be most appreciated. Thank you in advance for > >> >your help. > >> > > >> > > >> >_______________________________________________ > >> >torqueusers mailing list > >> >torqueusers at supercluster.org > >> >http://www.supercluster.org/mailman/listinfo/torqueusers > > > >_______________________________________________ > >torqueusers mailing list > >torqueusers at supercluster.org > >http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Wed Dec 7 16:23:17 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Wed, 7 Dec 2011 15:23:17 -0800 Subject: [torqueusers] where is torque 4.0 beta code at? Message-ID: I just read through the 4.0 beta announcement posted on the Adaptive Computing web site but I do not see any mention of where the 4.0 beta code can be obtained. I also looked around in the Torque download link on the ADC web site as well with no luck. If one of the goals is to get people to test it then it has to be clear where to get the code. -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111207/69395bed/attachment.html From gus at ldeo.columbia.edu Wed Dec 7 16:37:05 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 7 Dec 2011 18:37:05 -0500 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> <64159374-27FD-466C-9C15-7F349896D8A3@ldeo.columbia.edu> Message-ID: On Dec 7, 2011, at 5:43 PM, Ricardo Rom?n Brenes wrote: > yeah but that would solve the mystery about hte nodes=4 and nodes=2:ppn=2 but what can i do about hte queues... > > Ill try to recompile the pbs but... im not sensing that its going to that way... > Niether am I. I would try troubleshooting via Torque first. Have you looked for clues in the pbs_server logs (under server_logs) and the maui logs (file maui.log) in zarate-0? Have you looked at the system logs in the server and the nodes? [In my vanilla CentOS Linux this is /var/log/messages, but I have no idea where it may be in the flavor of Linux that you run in PS3. Just curious, what Linux is it?] There are tons of things that can potentially go wrong: permissions, authorization to use ports, etc. Worth checking them out. Gus Correa > On Wed, Dec 7, 2011 at 4:30 PM, Gustavo Correa wrote: > >From the Torque Administrator Guide, version 3.0.3: > > "By default, the node resource is mapped to a virtual node (that is, directly to a processor, not a full physical compute node). This behavior can be changed within Maui or Moab by setting the JOBNODEMATCHPOLICY parameter. See Appendix F of the Moab Workload Manager Administrator's Guide for more information." > > Besides how 'node' is mapped, and the value of specifying also ppn= in your #PBS -l directive, > do you remember what Glen Beane told you also about Maiu JOBNODEMATCHPOLICY? > > The Torque Admin Guide is a highly recommended reading! :) > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/index.php > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php#resources > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php#nodeExamples > > Likewise for the Maui Admin Guide: > http://www.adaptivecomputing.com/resources/docs/maui/index.php > > Cheers, > Gus Correa > > On Dec 7, 2011, at 4:52 PM, Ricardo Rom?n Brenes wrote: > > > oops i missed that part! but it doenst work. It works on queue uno but not on queue dos (its the same behavoir as before with nodes=4). > > > > i thinking in reinstall the pbs_mom (aka recompiling...) > > > > On Wed, Dec 7, 2011 at 3:50 PM, Gustavo Correa wrote: > > Hi Ricardo > > On Dec 7, 2011, at 4:41 PM, Ricardo Rom?n Brenes wrote: > > > > > yeah i have nodes=4, but if it works on queue uno why not on queue dos? > > > > > I don't know the answer. > > it may have to do with defaults and or decisions that Torque may make for you > > when you are not specific about the ppn in your PBS directives, but I don't really know. > > Maybe the Torque developers have a hint. > > > > However, you didn't answer my previous question either: > > > > > Have you tried this? > > > > > > #PBS -q dos > > > #PBS -l nodes=2:ppn=2 > > > > Did you try it? > > > > Gus Correa > > > > > > > On Wed, Dec 7, 2011 at 3:37 PM, Gustavo Correa wrote: > > > Hi Ricardo > > > > > > These two lines: > > > > > > > Resource_List.nodect = 4 > > > > Resource_List.nodes = 4 > > > > > > suggest that you continue to ask for 4 nodes, like this: > > > > > > #PBS nodes=4 > > > > > > on queue 'dos'. > > > Or am I mistaken? > > > > > > Have you tried this? > > > > > > #PBS -q dos > > > #PBS -l nodes=2:ppn=2 > > > > > > As per a previous email, your pbs_server/nodes file lists only two nodes of type 'dos', not four. > > > If so, you don't have the nodes that requested on the job script, > > > which may explain why the job doesn't run. > > > > > > My two cents, > > > Gus Correa > > > > > > > > > On Dec 7, 2011, at 4:16 PM, Ricardo Rom?n Brenes wrote: > > > > > > > and.... > > > > > > > > [rroman at zarate-0:~/outputs]$ qstat -fl > > > > Job Id: 108.zarate-0 > > > > Job_Name = a.t > > > > Job_Owner = rroman at zarate-0 > > > > job_state = Q > > > > queue = dos > > > > server = zarate-0 > > > > Checkpoint = u > > > > ctime = Wed Dec 7 15:08:12 2011 > > > > Error_Path = zarate-0:/home/rroman/outputs/a.t.e108 > > > > Hold_Types = n > > > > Join_Path = n > > > > Keep_Files = n > > > > Mail_Points = a > > > > mtime = Wed Dec 7 15:23:30 2011 > > > > Output_Path = zarate-0:/home/rroman/outputs/a.t.o108 > > > > Priority = 0 > > > > qtime = Wed Dec 7 15:08:12 2011 > > > > Rerunable = True > > > > Resource_List.nodect = 4 > > > > Resource_List.nodes = 4 > > > > Resource_List.walltime = 01:00:00 > > > > Variable_List = PBS_O_HOME=/home/rroman,PBS_O_LANG=en_US.utf8, > > > > PBS_O_LOGNAME=rroman, > > > > PBS_O_PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/ > > > > usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local > > > > /maui/bin:/usr/local/maui/sbin:/home/rroman/bin, > > > > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash, > > > > PBS_O_TZ=America/Costa_Rica,PBS_O_HOST=zarate-0,PBS_SERVER=zarate-0, > > > > PBS_O_WORKDIR=/home/rroman/outputs,PBS_O_QUEUE=dos > > > > etime = Wed Dec 7 15:08:12 2011 > > > > exit_status = -3 > > > > submit_args = ../a.t -q dos -l nodes=4 > > > > start_time = Wed Dec 7 15:23:30 2011 > > > > Walltime.Remaining = 359 > > > > start_count = 918 > > > > fault_tolerant = False > > > > > > > > keeps saying exit_status -3 ! not sure what's happening here... > > > > _______________________________________________ > > > > torqueusers mailing list > > > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Wed Dec 7 16:44:21 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 07 Dec 2011 16:44:21 -0700 (MST) Subject: [torqueusers] where is torque 4.0 beta code at? In-Reply-To: Message-ID: Steve, It appears that announcement was a bit premature. We are targeting a December 22nd release of TORQUE 4.0 beta, as there are still some bugs we're trying to track down. However, things are promising and if anyone would like to get the latest code, feel free to check out the svn: svn co svn://svn.clusterresources.com/torque/trunk Once the beta code is released, expect an email both to the users and developers lists. Sorry for the confusion, David ----- Original Message ----- > > > > > I just read through the 4.0 beta announcement posted on the Adaptive > Computing web site but I do not see any mention of where the 4.0 > beta code can be obtained. I also looked around in the Torque > download link on the ADC web site as well with no luck. > > > > If one of the goals is to get people to test it then it has to be > clear where to get the code. > > -- > > Steven DuChene > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From knielson at adaptivecomputing.com Wed Dec 7 16:45:18 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 07 Dec 2011 16:45:18 -0700 (MST) Subject: [torqueusers] where is torque 4.0 beta code at? In-Reply-To: Message-ID: ----- Original Message ----- > From: "StevenX A DuChene" > To: torqueusers at supercluster.org > Sent: Wednesday, December 7, 2011 4:23:17 PM > Subject: [torqueusers] where is torque 4.0 beta code at? > > > > > > I just read through the 4.0 beta announcement posted on the Adaptive > Computing web site but I do not see any mention of where the 4.0 > beta code can be obtained. I also looked around in the Torque > download link on the ADC web site as well with no luck. > > > > If one of the goals is to get people to test it then it has to be > clear where to get the code. > > -- > > Steven DuChene Steven, We originally were hoping to have the beta out by the end of November. But we still have some issues to clear up before we can call it a beta. You can download the code using subversion by calling svn co svn://clusterresources.com/torque/trunk. Regards Ken From samuel at unimelb.edu.au Wed Dec 7 19:46:02 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Thu, 08 Dec 2011 13:46:02 +1100 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> <4EDEDC91.2050007@yahoo.com.cn> Message-ID: <4EE024EA.5080403@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/12/11 04:34, Gustavo Correa wrote: > Turning on SMT at run time was nice, > but I don't think this is [yet] feasible with hyperthreading in Linux. Depends on the architecture, you can do this with Linux on Power. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7gJOoACgkQO2KABBYQAh8p/gCeNMrMv9Jvk3sGLWXarGNNygit HNsAoIEtubUrJplMcxiBuopTbmPQNNbr =gBXA -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Wed Dec 7 19:47:14 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Thu, 08 Dec 2011 13:47:14 +1100 Subject: [torqueusers] max jobs per group per node In-Reply-To: References: Message-ID: <4EE02532.8080300@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/12/11 02:06, Govind B. Songara wrote: > I am looking an option in torque/maui to configure max jobs per group > per node. I'm not sure that Maui can do those sorts of multi-dimensional restrictions.. Might be a question for the maui users list. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7gJTIACgkQO2KABBYQAh/nawCdForatEDIM6x7l2zQlUN05SG+ ZwsAn2HKsVQcJkfH8I0E7/5kbdhwYBqz =QSd1 -----END PGP SIGNATURE----- From tiago.silva at cefas.co.uk Thu Dec 8 04:50:23 2011 From: tiago.silva at cefas.co.uk (Tiago Silva (Cefas)) Date: Thu, 8 Dec 2011 11:50:23 -0000 Subject: [torqueusers] hydra *and* mpd In-Reply-To: References: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BD5@LOWEXPRESS.corp.cefas.co.uk> Message-ID: <04A370231C10664C88B28D1EF74F487903360BD7@LOWEXPRESS.corp.cefas.co.uk> Thanks for the useful advice. We are using mpd on this second version of mpich2 beacause at the time we couldn't get it to work with hydra support. We will have to revisit that. In terms of torque and oscmpiexec how do I set it up for two different mpich2 installations? Thanks, tiago -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gustavo Correa Sent: 07 December 2011 17:42 To: Torque Users Mailing List Subject: Re: [torqueusers] hydra *and* mpd Hi Tiago Use OSC mpiexec alone. It works beautifully with Torque. No need for mpd in this case, and it may actually only mess things up. I would shut down all mpd rings that may be there, and forget about mpd. Actually, mpd was deprecated by the MPICH2 development team, AFAIK. Make sure you point to the OSC mpiexec in the command line, say, by using full path name to it, to avoid the risk of mistakenly using the mpiexec that came with MPICH2. My two cents, Gus Correa On Dec 7, 2011, at 4:30 AM, Tiago Silva (Cefas) wrote: > I have read about how to submit jobs using OSC's mpiexec and with mpd. > Simple question, can I mix them, and all the jobs will all be accounted for by torque? > > tiago > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva (Cefas) > Sent: 06 December 2011 14:30 > To: torqueusers at supercluster.org > Subject: [torqueusers] hydra *and* mpd > > Hi > > We have a 20 node cluster with rocks 5.3 and mpich2 1.3.1. Most users use mpiexec with hydra, but one of our models requires a second version of mpich that was compiled differently and we submit these jobs using mpirun and an mpd ring. > > We have rocks 5.3 but due to lack of foresight torque wasn't installed when the cluster was built. We are planning to install torque on top of the existing installation. Would torque be able to handle two different initialization methods (mpirun and mpiexec)? > > Thanks, > Tiago > > > > > > > > This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas' computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111208/7f31f64f/attachment.html From tiago.silva at cefas.co.uk Thu Dec 8 04:51:21 2011 From: tiago.silva at cefas.co.uk (Tiago Silva (Cefas)) Date: Thu, 8 Dec 2011 11:51:21 -0000 Subject: [torqueusers] hydra *and* mpd In-Reply-To: <04A370231C10664C88B28D1EF74F487903360BD7@LOWEXPRESS.corp.cefas.co.uk> References: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BD5@LOWEXPRESS.corp.cefas.co.uk> <04A370231C10664C88B28D1EF74F487903360BD7@LOWEXPRESS.corp.cefas.co.uk> Message-ID: <04A370231C10664C88B28D1EF74F487903360BD8@LOWEXPRESS.corp.cefas.co.uk> Or rather, it is feasible? If so I will look up how to do it. tiago From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva (Cefas) Sent: 08 December 2011 11:50 To: Torque Users Mailing List Subject: Re: [torqueusers] hydra *and* mpd Thanks for the useful advice. We are using mpd on this second version of mpich2 beacause at the time we couldn't get it to work with hydra support. We will have to revisit that. In terms of torque and oscmpiexec how do I set it up for two different mpich2 installations? Thanks, tiago -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gustavo Correa Sent: 07 December 2011 17:42 To: Torque Users Mailing List Subject: Re: [torqueusers] hydra *and* mpd Hi Tiago Use OSC mpiexec alone. It works beautifully with Torque. No need for mpd in this case, and it may actually only mess things up. I would shut down all mpd rings that may be there, and forget about mpd. Actually, mpd was deprecated by the MPICH2 development team, AFAIK. Make sure you point to the OSC mpiexec in the command line, say, by using full path name to it, to avoid the risk of mistakenly using the mpiexec that came with MPICH2. My two cents, Gus Correa On Dec 7, 2011, at 4:30 AM, Tiago Silva (Cefas) wrote: > I have read about how to submit jobs using OSC's mpiexec and with mpd. > Simple question, can I mix them, and all the jobs will all be accounted for by torque? > > tiago > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva (Cefas) > Sent: 06 December 2011 14:30 > To: torqueusers at supercluster.org > Subject: [torqueusers] hydra *and* mpd > > Hi > > We have a 20 node cluster with rocks 5.3 and mpich2 1.3.1. Most users use mpiexec with hydra, but one of our models requires a second version of mpich that was compiled differently and we submit these jobs using mpirun and an mpd ring. > > We have rocks 5.3 but due to lack of foresight torque wasn't installed when the cluster was built. We are planning to install torque on top of the existing installation. Would torque be able to handle two different initialization methods (mpirun and mpiexec)? > > Thanks, > Tiago > > > > > > > > This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas' computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas' computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111208/71619762/attachment-0001.html From jjc at iastate.edu Thu Dec 8 09:19:31 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 8 Dec 2011 16:19:31 +0000 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> <4EDEDC91.2050007@yahoo.com.cn> Message-ID: <242421BFAF465844BE24EB90BB97E221017DE65E@ITSDAG3D.its.iastate.edu> Hongsheng, For Intel processors which support HyperThreading, I generally set np=P the total number of physical cores, but turn hyper-threading on. I generally saw about a 10% improvement for most jobs on a single core Pentium 4 Xeon, which I attributed mostly to lower OS overhead. Reasoning: --------- Intel processors which are capable of hyper-threading have two instruction counters and two sets of registers per core, so that they can support two independent streams of instructions simultaneously. Much of the time, the instructions must take turns, an instruction from one thread executing on even clock cycles and from the other on the odd clock cycles. Why not set n=2P? I don't set np=2P because my user's codes are MPI, and are all executing the same mix of instructions, so they rarely can execute at the same time, so I'd end up using half the nodes, but the codes would run half as fast. Why use hyperthreading then? The OS has to also run some small percentage of the time, and if I am running P MPI processes on a node, the OS will need to interrupt one of the user's MPI processes to execute. With Hyperthreading no interruption/context switch is needed. Pro's and Cons of HyperThreading in my case: Pros:1) The OS does not have to interrupt user processes and do a context switch on the way in and then on the way back out to get its work done. (The program's context is the instruction counter and registers which define the state of execution for a thread/process.) This due to the fact that HT maintains two 2P states, so you can easily maintain P users states plus one state for the OS. 2) My user's are scientists who mainly use the floating point units of the cores, while the OS mainly uses integers and pointers. Since these are separate units within the core, and can be used simultaneously, hyperthreading can does two things at the same time. Cons: 1) HT increases cache misses: Hyperthreading (at least in the Pentium 4 Xeon) turned off the Level-1 instruction cache, and halved the level 2 cache (each virtual processor gets half) I believe by halving the cache associativity. 2) There is slightly more power use due to more of the chip being active at the same time. Background: (From memory, so specific details about RISC core may not be exact.) In the Pentium 4 Xeon, assembly/machine instructions we decoded into several micro-instructions which are executed by the RISC core. I believe that there were 7 micro-instruction queues but the processor could execute at most 5 micro-instructions simultaneously due to shared resources. The instructions streams from each processor are interleaved, one from thread 1 one from thread 2. This alone helps because the two threads appear to be running on a processor which is running at half the speed, but the memory latency is half the number of clock cycles. In addition, after the interleaving, the instructions are allowed to "slide down" in the instruction queue if there are no other instructions ahead of it. This helps when the machine instructions for the two threads decompose into micro-instructions which utilize independent parts of the RISC core, and so can execute simultaneously, increasing the processor performance. James Coyle, PhD High Performance Computing Group 115 Durham Center Iowa State Univ. phone: (515)-294-2099 Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Gustavo Correa >Sent: Wednesday, December 07, 2011 11:34 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] My issue when changing the nodes list for >a queue. > > >On Dec 6, 2011, at 10:25 PM, Hongsheng Zhao wrote: > >> On 12/06/2011 09:02 PM, Jan Kasiak wrote: >>> Hi, >>> >>> Thats not actually true. I think half the cores are >hyperthreaded. Look >>> up your processor on Intel's website for exact core count. >cpuinfo >>> reports logical and not physical cores. It also looks like you >have a >>> dual socket node. >>> >> >> Do you mean the np should be equal to the number of the physical >cores >> for specific node? > >Hi Hongsheng > >As far as I know, this is debatable. >Some parallel programs don't scale well when hyperthreading is >turned on, >but other programs do well. >Scaling is seldom what you would expect with physical cores, i.e., >if you have >8 physical cores and it appears as 16 because of hyperthreading, >when >you switch your program from using 8 to 16 processes, the speedup is >not >a factor of 2, but often significantly less. >However, a factor of 1.2 may still be a very good thing. > >There is some hassle to manage the situation in a hyperthreaded node >when several >jobs share the node. So, in this case, to make things easy >from the management standpoint at least, >you may choose to turn off hyperthreading, which is typically done >on the BIOS settings. > >Two years ago I ran some jobs on an IBM machine >where hyperthreading [which IBM calls symmetric multithreading or >SMT] >could be turned on or off at runtime by the MPI job. Each node had >32 physical cores, >that could look like 64 by just setting an environment variable in >the job script. >Speedups in the specific program I ran [a climate general >circulation model] when >going from 32 physical cores to 64 SMT cores was more like 1.2-1.3 >than 2. >However, this was still very good, specially considering that the >lab charges were based >on full node utilization per hour, regardless of SMT being on or >off. > >Turning on SMT at run time was nice, >but I don't think this is [yet] feasible with hyperthreading in >Linux. >[I may be wrong about this, and somebody more knowledgeable on these >matters >in the list could clarify the point.] > >My two cents, >Gus Correa > > >> >> Regards >> -- >> Hongsheng Zhao >> School of Physics and Electrical Information Science, >> Ningxia University, Yinchuan 750021, China >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From leggett at mcs.anl.gov Thu Dec 8 09:36:59 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Thu, 8 Dec 2011 10:36:59 -0600 Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting Message-ID: I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, MOMs keep randomly segfaulting and dying. I see this in the MOM log right before dying: 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit And something similar to this in dmesg: pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f rsp 00007fff19e96df0 error 4 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111208/9746a21c/attachment.bin From gus at ldeo.columbia.edu Thu Dec 8 09:39:49 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 8 Dec 2011 11:39:49 -0500 Subject: [torqueusers] My issue when changing the nodes list for a queue. In-Reply-To: <4EE024EA.5080403@unimelb.edu.au> References: <4ED85F47.7090005@yahoo.com.cn> <4ED86618.7090105@unimelb.edu.au> <4ED87EF9.7070306@yahoo.com.cn> <6ACCD0EB-6813-483C-A394-5249E819F85E@ldeo.columbia.edu> <4ED9AEF5.8080709@yahoo.com.cn> <5EB0A5E4-80C2-4D68-BA17-36CB2C81C1C8@ldeo.columbia.edu> <4EDDAA57.4090308@yahoo.com.cn> <4EDDCCE4.8080602@yahoo.com.cn> <4EDEDC91.2050007@yahoo.com.cn> <4EE024EA.5080403@unimelb.edu.au> Message-ID: <1B1C9BA8-E25D-4BE1-BCB5-FA5DF437AA6B@ldeo.columbia.edu> Hi Chris Thank you for your insightful comments! I only wish you wrote a bit more detail about it. Hongsheng's questions about Torque setup are going a long way ... :) My experience with turning on/off SMT at runtime was on an IBM computer running AIX, not Linux, but indeed the processors were PowerPC. Any chances that at some point this "runtime on/off hyperthreading" feature will become available in hyperthreaded Intel processors in Linux? You say this is architecture dependent. If anything, what prevents Intel Nehalem, Westmere and later generations to do so? Regards, Gus Correa On Dec 7, 2011, at 9:46 PM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 08/12/11 04:34, Gustavo Correa wrote: > >> Turning on SMT at run time was nice, >> but I don't think this is [yet] feasible with hyperthreading in Linux. > > Depends on the architecture, you can do this with Linux on Power. > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7gJOoACgkQO2KABBYQAh8p/gCeNMrMv9Jvk3sGLWXarGNNygit > HNsAoIEtubUrJplMcxiBuopTbmPQNNbr > =gBXA > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Thu Dec 8 09:50:04 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 8 Dec 2011 11:50:04 -0500 Subject: [torqueusers] hydra *and* mpd In-Reply-To: <04A370231C10664C88B28D1EF74F487903360BD7@LOWEXPRESS.corp.cefas.co.uk> References: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BD5@LOWEXPRESS.corp.cefas.co.uk> <04A370231C10664C88B28D1EF74F487903360BD7@LOWEXPRESS.corp.cefas.co.uk> Message-ID: H[O]i Tiago In my experience, you can use the same OSC mpiexec to launch programs compiled with different versions of MPICH2. http://www.osc.edu/~djohnson/mpiexec/index.php Note that OSC mpiexec installation is completely independent from MPICH2. Therefore, you need to be careful enough to point to the OSC mpiexec, not to the MPICH2 mpiexec, when you launch MPI programs. However, you will continue to compile the MPI programs with MPICH2 mpicc, mpif90, etc. This can be achieved in several ways, the simplest one being to use full path name to OSC mpiexec. You could also put it on the top of your PATH in your .profile/.bashrc or .[t]cshrc file, which is quite simple also, and perhaps a bit cleaner. You could also use the environment module package [Evironment module is a neat solution, but takes some work to setup, and may only be justified if you have a cluster with many users, bunch of software versions that people want to interchange, etc.] http://modules.sourceforge.net/ I hope this helps, Gus Correa On Dec 8, 2011, at 6:50 AM, Tiago Silva (Cefas) wrote: > Thanks for the useful advice. We are using mpd on this second version of > mpich2 beacause at the time we couldn't get it to work with hydra > support. We will have to revisit that. > > In terms of torque and oscmpiexec how do I set it up for two different > mpich2 installations? > > Thanks, > tiago > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gustavo > Correa > Sent: 07 December 2011 17:42 > To: Torque Users Mailing List > Subject: Re: [torqueusers] hydra *and* mpd > > Hi Tiago > > Use OSC mpiexec alone. It works beautifully with Torque. > No need for mpd in this case, and it may actually only mess things up. > I would shut down all mpd rings that may be there, and forget about mpd. > Actually, mpd was deprecated by the MPICH2 development team, AFAIK. > > Make sure you point to the OSC mpiexec in the command line, say, > by using full path name to it, to avoid the risk of mistakenly using the > > mpiexec that came with MPICH2. > > My two cents, > Gus Correa > > On Dec 7, 2011, at 4:30 AM, Tiago Silva (Cefas) wrote: > > > I have read about how to submit jobs using OSC's mpiexec and with mpd. > > Simple question, can I mix them, and all the jobs will all be > accounted for by torque? > > > > tiago > > > > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva > (Cefas) > > Sent: 06 December 2011 14:30 > > To: torqueusers at supercluster.org > > Subject: [torqueusers] hydra *and* mpd > > > > Hi > > > > We have a 20 node cluster with rocks 5.3 and mpich2 1.3.1. Most users > use mpiexec with hydra, but one of our models requires a second version > of mpich that was compiled differently and we submit these jobs using > mpirun and an mpd ring. > > > > We have rocks 5.3 but due to lack of foresight torque wasn't installed > when the cluster was built. We are planning to install torque on top of > the existing installation. Would torque be able to handle two different > initialization methods (mpirun and mpiexec)? > > > > Thanks, > > Tiago > > > > > > > > > > > > > > > > This email and any attachments are intended for the named recipient > only. Its unauthorised use, distribution, disclosure, storage or copying > is not permitted. If you have received it in error, please destroy all > copies and notify the sender. In messages of a non-business nature, the > views and opinions expressed are the author's own and do not necessarily > reflect those of Cefas. Communications on Cefas' computer systems may be > monitored and/or recorded to secure the effective operation of the > system and for other lawful purposes. > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Thu Dec 8 09:51:52 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 8 Dec 2011 11:51:52 -0500 Subject: [torqueusers] hydra *and* mpd In-Reply-To: <04A370231C10664C88B28D1EF74F487903360BD8@LOWEXPRESS.corp.cefas.co.uk> References: <04A370231C10664C88B28D1EF74F487903360BC8@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BCF@LOWEXPRESS.corp.cefas.co.uk><04A370231C10664C88B28D1EF74F487903360BD5@LOWEXPRESS.corp.cefas.co.uk> <04A370231C10664C88B28D1EF74F487903360BD7@LOWEXPRESS.corp.cefas.co.uk> <04A370231C10664C88B28D1EF74F487903360BD8@LOWEXPRESS.corp.cefas.co.uk> Message-ID: <594E59E2-42C3-49D5-8D0E-96E64C4D479A@ldeo.columbia.edu> H[O}i Tiago Yes, feasilble, see my previous email. If using OSC mpiexec, turn off mpd completely, shutting down any mpd rings that you may have established. I hope this helps, Gus Correa On Dec 8, 2011, at 6:51 AM, Tiago Silva (Cefas) wrote: > Or rather, it is feasible? If so I will look up how to do it. > > tiago > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva (Cefas) > Sent: 08 December 2011 11:50 > To: Torque Users Mailing List > Subject: Re: [torqueusers] hydra *and* mpd > > Thanks for the useful advice. We are using mpd on this second version of > mpich2 beacause at the time we couldn't get it to work with hydra > support. We will have to revisit that. > > In terms of torque and oscmpiexec how do I set it up for two different > mpich2 installations? > > Thanks, > tiago > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gustavo > Correa > Sent: 07 December 2011 17:42 > To: Torque Users Mailing List > Subject: Re: [torqueusers] hydra *and* mpd > > Hi Tiago > > Use OSC mpiexec alone. It works beautifully with Torque. > No need for mpd in this case, and it may actually only mess things up. > I would shut down all mpd rings that may be there, and forget about mpd. > Actually, mpd was deprecated by the MPICH2 development team, AFAIK. > > Make sure you point to the OSC mpiexec in the command line, say, > by using full path name to it, to avoid the risk of mistakenly using the > > mpiexec that came with MPICH2. > > My two cents, > Gus Correa > > On Dec 7, 2011, at 4:30 AM, Tiago Silva (Cefas) wrote: > > > I have read about how to submit jobs using OSC's mpiexec and with mpd. > > Simple question, can I mix them, and all the jobs will all be > accounted for by torque? > > > > tiago > > > > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tiago Silva > (Cefas) > > Sent: 06 December 2011 14:30 > > To: torqueusers at supercluster.org > > Subject: [torqueusers] hydra *and* mpd > > > > Hi > > > > We have a 20 node cluster with rocks 5.3 and mpich2 1.3.1. Most users > use mpiexec with hydra, but one of our models requires a second version > of mpich that was compiled differently and we submit these jobs using > mpirun and an mpd ring. > > > > We have rocks 5.3 but due to lack of foresight torque wasn't installed > when the cluster was built. We are planning to install torque on top of > the existing installation. Would torque be able to handle two different > initialization methods (mpirun and mpiexec)? > > > > Thanks, > > Tiago > > > > > > > > > > > > > > > > This email and any attachments are intended for the named recipient > only. Its unauthorised use, distribution, disclosure, storage or copying > is not permitted. If you have received it in error, please destroy all > copies and notify the sender. In messages of a non-business nature, the > views and opinions expressed are the author's own and do not necessarily > reflect those of Cefas. Communications on Cefas' computer systems may be > monitored and/or recorded to secure the effective operation of the > system and for other lawful purposes. > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. > > > > > > > This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of Cefas. Communications on Cefas? computer systems may be monitored and/or recorded to secure the effective operation of the system and for other lawful purposes. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Thu Dec 8 09:57:36 2011 From: glen.beane at gmail.com (Glen Beane) Date: Thu, 8 Dec 2011 11:57:36 -0500 Subject: [torqueusers] different config shows in root and users accounts In-Reply-To: References: Message-ID: On Wed, Nov 30, 2011 at 3:15 PM, Ricardo Rom?n Brenes wrote: > hi guys. > > So i have this Torque running and when i print the server as root i get: > > [root at zarate-0 ~]# qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > create queue uno > set queue uno queue_type = Execution > set queue uno acl_host_enable = False > set queue uno acl_hosts = zarate-1 > set queue uno acl_hosts += zarate-0 > set queue uno resources_default.neednodes = uno?? ? ? ?<---------------- > THIS LINE > set queue uno resources_default.nodes = 1:uno > set queue uno enabled = True > set queue uno started = True > # > # Create and define queue dos > # > create queue dos > set queue dos queue_type = Execution > set queue dos acl_host_enable = False > set queue dos acl_hosts = zarate-3 > set queue dos acl_hosts += zarate-2 > set queue dos resources_default.neednodes = dos ? ? ? ?<---------------- AND > THIS LINE > set queue dos resources_default.nodes = 1:dos > set queue dos enabled = True > set queue dos started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 37 > > > > but when it as a user i get: > > [rroman at zarate-0:~/outputs]$ qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > create queue uno > set queue uno queue_type = Execution > set queue uno acl_host_enable = False > set queue uno acl_hosts = zarate-1 > set queue uno acl_hosts += zarate-0 > set queue uno resources_default.nodes = 1:uno > set queue uno enabled = True > set queue uno started = True > # > # Create and define queue dos > # > create queue dos > set queue dos queue_type = Execution > set queue dos acl_host_enable = False > set queue dos acl_hosts = zarate-3 > set queue dos acl_hosts += zarate-2 > set queue dos resources_default.nodes = 1:dos > set queue dos enabled = True > set queue dos started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 37 > > > > > see those 2 lines up there? those 2 dont show up as user... is this normal? > could this be messing with my configuration in a way taht Maui cant assign > the correct nodes on job submissions? this is normal. some parameters are not queryable by a non operator/manager in TORQUE From cholam20 at yahoo.co.in Thu Dec 8 13:04:24 2011 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Fri, 9 Dec 2011 01:34:24 +0530 (IST) Subject: [torqueusers] Your question... Message-ID: <1323374664.78082.androidMobile@web137314.mail.in.yahoo.com>

Hola!
don't let insufficient funds hold you back in life
http://www.curtain-plage.co.jp/profile/75NicholasWilliams/
goodbye.

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111209/a2bc3a1b/attachment.html From jakob.blomqvist at mah.se Mon Dec 12 17:16:10 2011 From: jakob.blomqvist at mah.se (Jakob Blomqvist) Date: Tue, 13 Dec 2011 00:16:10 +0000 Subject: [torqueusers] Problem with password free scp on small cluster Message-ID: Dear all, I would like to use torque in order to use a mini-cluster consisting of three identical amd 16 core machines in order to perform parallel QC-calculations These are the specs: They three machines are connected to each other and internet via Ethernet cable to a switch regular internet connected switch and have all static IP provided by the IT-guys. I have installed Ubuntu 10.04.3 Lucid LTS server edition on all three and I call them super1, super 2, and super3, which I also set in /etc/hosts on all all three machines. One machine only are accessed by users via ssh using rsa-keys in order to login without password and using port different from 22. I have created a NFS by creating a symbolic link of super2's /home into /export/home and then mounted super2's /export/home catalogue in /mnt/home on super1 and super3, however super1 and super3 have their own user home catalogues (is this a mistake? necessary?) The thing is: I would like to have torque simply be able to use all three machines in mpirun calculations but as I try to set it up I have serious problems with error messages indicating e.g. scp access problem (caused by the ssh configuration I'm sure). 1. Is there a simple way to set my system up so that I have at least moderate security and (if possible can keep my password free login, not necessary though) and still have no problem with the communication between the machines? I can, and have at the moment, set it up so that they can ssh between each other by creating key-pairs between super2 and super3 etc. and then I can finally use torque without this error. But it honestly feels like there should be a simpler way for torque using rcp or hosts.equiv something. or? 2. And related to the previous question. As I try to run a submit-script I might have to copy a file (input file to software) to the dedicated master node. However I can't even use cp since I have set it up using ssh. How do I take advantage of my or a similar NFS if possible to get around this? I have tried to set various usecp variants in /var/spool/torque/mom_priv/config to no avail. I manually downloaded Torque 2.4.8 debian files from Ubuntu Maverick 10.10 Distro and installed Torque-common, client, Torque-Server, mom and sched on the Server machine (super2) and Torque-common, client and mom on the nodes (super1, super2) using dpgk -i. I added the /var/spool/torque/server_priv/nodes super2 np=16 super3 np=16 my qmgr: # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_running = 32 set queue batch resources_max.ncpus = 32 set queue batch resources_max.nodes = 2 set queue batch resources_min.ncpus = 1 set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1:ppn=1 set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = super2 set server managers = jakob at super2 set server operators = jakob at super2 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server job_nanny = True set server mom_job_sync = True set server keep_completed = 60 set server next_job_number = 300020 ### Any suggestions would be appreciated. Best, Jakob Blomquist Associate Professor Dep. of Material Science IMP, School of Technology Malmo University SWEDEN +46(0)40 6657751 jakob.blomqvist at mah.se -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111213/80a13946/attachment.html From miguel.gila at cscs.ch Tue Dec 13 02:02:25 2011 From: miguel.gila at cscs.ch (Gila Arrondo Miguel Angel) Date: Tue, 13 Dec 2011 09:02:25 +0000 Subject: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox In-Reply-To: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE@cscs.ch> Message-ID: Hi, we finally figured out what was going on. It turns out that the SCP errors were due to users canceling their jobs and CREAM-CE erasing the job's home before Torque did the stage-out phase. Since we are using scp and it logs to /var/log/messages, we saw tons of error messages (one for each file in the job). So in the end there was no problem :) Cheers, Miguel On 11/17/11 8:55 AM, "Gila Arrondo Miguel Angel" wrote: >Hi Chris, > >I've done that in many WNs and with different users, so I don't think >that is be the issue. I've also checked for scheduled tasks that interact >with the ssh keys, but the errors happen at random times, not when the >scheduled tasks run... :-S > >I'm running out of options here. > >Cheers, >Miguel > >On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: >> >>> Many thanks for your answer. We've made sure that the >>> keys are okay, as well as disabling hoskeychecking to >>> test it. >> >> Can you try and scp as that user to see whether it >> complains about anything else ? >> >> It may be that it is prompting the user to accept a >> host key if they don't already have it. >> >> cheers, >> Chris >> - -- >> Christopher Samuel - Senior Systems Administrator >> VLSCI - Victorian Life Sciences Computation Initiative >> Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 >> http://www.vlsci.unimelb.edu.au/ >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.11 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >> >> iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW >> sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS >> =VPqK >> -----END PGP SIGNATURE----- >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > >-- >Miguel Gila >CSCS Swiss National Supercomputing Centre >HPC Solutions >Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland >miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From ir_m_m_a_p at yahoo.com Tue Dec 13 10:45:26 2011 From: ir_m_m_a_p at yahoo.com (meysam miralipoor) Date: Tue, 13 Dec 2011 09:45:26 -0800 (PST) Subject: [torqueusers] (no subject) Message-ID: <1323798326.87144.yint-ygo-j2me@web160105.mail.bf1.yahoo.com> LOL! You will be satisfied after this!... http://www.missganesha.be/happy.friends.php?oyahooID=85jk6 From mrobbert at mines.edu Tue Dec 13 10:33:56 2011 From: mrobbert at mines.edu (Michael Robbert) Date: Tue, 13 Dec 2011 10:33:56 -0700 Subject: [torqueusers] pbs_mom stuck in loop Message-ID: We have a cluster of about 64 nodes running Scyld Clusterware 5.6.3 which ships with Torque 2.5.6 and we are running Maui 3.3.1 as the scheduler on top of that. We are seeing nodes approximately daily showing up as down according to Torque, but they otherwise look normal. Sometimes we find that pbs_mom has crashed, but other times we find that it is still running and appears to be stuck in a loop. Yesterday I had a node that was stuck in the loop and I was able to attach gdb to the process to confirm that. I found that it was stuck in the function scan_non_child_tasks inside mom_mach.c. I'm able to step forward to confirm that it keeps running a loop in that function. I'm not a programmer, but it looks to me like the linked list that it is attempting to traverse has the same address for both the next and previous tasks no matter how many time we go through the loop. Here is some output to demonstrate: 3761 in mom_mach.c (gdb) p *task $2 = {ti_job = 0xa5e6130, ti_jobtask = {ll_prior = 0xa5e5108, ll_next = 0xa5e5108, ll_struct = 0xa5e5100}, ti_fd = -1, ti_flags = 0, ti_register = 0, ti_obits = {ll_prior = 0xa5e5130, ll_next = 0xa5e5130, ll_struct = 0x0}, ti_info = {ll_prior = 0xa5e5148, ll_next = 0xa5e5148, ll_struct = 0x0}, ti_qs = { ti_parentjobid = "199592.mio.mines.edu", '\000' , ti_parentnode = -1, ti_parenttask = 0, ti_task = 1, ti_status = 3, ti_sid = 86040, ti_exitstat = 0, ti_u = {ti_hold = { 0 }}}} (gdb) n 3783 in mom_mach.c (gdb) n 3761 in mom_mach.c (gdb) n 3751 in mom_mach.c (gdb) n 3761 in mom_mach.c (gdb) p *task $3 = {ti_job = 0xa5e6130, ti_jobtask = {ll_prior = 0xa5e5108, ll_next = 0xa5e5108, ll_struct = 0xa5e5100}, ti_fd = -1, ti_flags = 0, ti_register = 0, ti_obits = {ll_prior = 0xa5e5130, ll_next = 0xa5e5130, ll_struct = 0x0}, ti_info = {ll_prior = 0xa5e5148, ll_next = 0xa5e5148, ll_struct = 0x0}, ti_qs = { ti_parentjobid = "199592.mio.mines.edu", '\000' , ti_parentnode = -1, ti_parenttask = 0, ti_task = 1, ti_status = 3, ti_sid = 86040, ti_exitstat = 0, ti_u = {ti_hold = { 0 }}}} (gdb) bt full #0 scan_non_child_tasks () at mom_mach.c:3761 dent = task = 0xa5e5100 job = 0xa5e6130 pdir = 0xa62bd10 first_time = 0 #1 0x0000000000416fe9 in main_loop () at mom_main.c:8251 myla = 2.4703282292062327e-323 tmpTime = id = "main_loop" #2 0x0000000000417221 in main (argc=5, argv=0x7fffa0d03718) at mom_main.c:8406 rc = 0 tmpFD = (gdb) Any thoughts on how we're getting here and better yet how to prevent it? Thanks, Mike Robbert From cwebberops at gmail.com Wed Dec 14 09:48:04 2011 From: cwebberops at gmail.com (Christopher Webber) Date: Wed, 14 Dec 2011 08:48:04 -0800 Subject: [torqueusers] Array IDs greater than 99999 Message-ID: <116B86BD-7177-49C2-BEF5-ABA9F848F7ED@gmail.com> All, I have a PI who is trying to use array ID's greater than 99999 and we are getting back a "Bad Job Array Request" error. I have added the error below and am able to reproduce the problem myself. 12/14/2011 08:36:53;0080;PBS_Server;Req;req_reject;Reject reply code=15083(Bad Job Array Request), aux=0, type=Commit, from cwebber at biocluster The manual notes that max_job_array_size is by default unlimited but in an attempt to troubleshoot I set it to 100000000 for testing with no change. Any suggestions? -- cwebber From cholam20 at yahoo.co.in Wed Dec 14 11:18:19 2011 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Wed, 14 Dec 2011 23:48:19 +0530 (IST) Subject: [torqueusers] try it out for yourself Message-ID: <1323886699.73967.androidMobile@web137306.mail.in.yahoo.com>

Hola.
im finally turning my life around since I found this ive never had a late payment now im on top of my game imagine the possibilities
http://serviciogamo.com/profile/23DarrenBrown/
goodbye.

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111214/b6cdaf56/attachment.html From roman.ricardo at gmail.com Wed Dec 14 15:42:42 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 14 Dec 2011 16:42:42 -0600 Subject: [torqueusers] maui wont run jobs from 1 of 2 queues In-Reply-To: References: <7CBB1A19-DB96-4A0B-BF71-9572D9D193D9@ur.rochester.edu> <0C9EAF1B-6914-441E-885A-BE11103368AB@ldeo.columbia.edu> <711927CF-535C-4D31-9084-7B6348D307F8@ldeo.columbia.edu> <652371C9-4CCF-45E1-AD9A-AA8A3A6C423F@ldeo.columbia.edu> <64159374-27FD-466C-9C15-7F349896D8A3@ldeo.columbia.edu> Message-ID: im using a Fedora 11 PPC in the PS3. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111214/551376c0/attachment.html From thomas.zeiser at rrze.uni-erlangen.de Thu Dec 15 11:28:57 2011 From: thomas.zeiser at rrze.uni-erlangen.de (Thomas Zeiser) Date: Thu, 15 Dec 2011 19:28:57 +0100 Subject: [torqueusers] use of qchkpt Message-ID: <20111215182857.GA7875@rrze.uni-erlangen.de> Dear All, what is the use of "qchkpt"? If torque is configured correctly with BLCR support, "qchkpt" can be used to make a "snapshot" of a running job. However, is there any possibility to resume from that generated checkpoint? Best regards, Thomas Zeiser From listsarnau at gmail.com Thu Dec 15 04:28:48 2011 From: listsarnau at gmail.com (Arnau Bria) Date: Thu, 15 Dec 2011 12:28:48 +0100 Subject: [torqueusers] limiting resource usage with torque Message-ID: <20111215122848.6eab11c0@amarrosa.pic.es> Hi all, We were testing how to limit and request resource usage with torque. Doc, and some docs I found on the net, said that defining resources_max at queue level is enough for limitng resource usage: * pag 62 of torque doc v 3.0.0 resource_max Specifies the maximum resource limits for jobs submitted to the queue So, we did something like : resources_max.vmem=6gb Also, after configuring 'size [fs=/home]' on all nodes, we added some default resource request (disk free space) at submitfilter level: line="#PBS -l file=30gb -c n" from mnan: -l resource_list Defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. jobs were submitted with : Resource_List.file = 30gb Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.pvmem = 6000mb which seemed to work fine, but after some jobs started running, we noticed that nodes were not running all the jobs they were supposed to, although being in free state. I.e, a node with 24gb os mem (PHYS+SWAP) using only 12gb of mem did not run more than 4 jobs when 8 was its limit. So, if it had free resources why is it not running more jobs? After some debugging we found the source. MAUI was reserving 6gb of mem for each job. so, 4 jobs*6gb of mem = 24gb. All the mem was reserved for those 4 jobs and the node is not selected for running more. from checknode: [...] Configured Resources: PROCS: 8 MEM: 15G SWAP: 23G DISK: 122G Utilized Resources: SWAP: 5048M DISK: 35G Dedicated Resources: PROCS: 4 SWAP: 23G DISK: 30G [...] And we suppose that something similar was going to happen with DISK resource if more jobs start (yep, we have some node with low disk space). So, did we understand correctly the resource.max parameter and -l qsub option? Why that maui resource reservation? Maybe this question should go to maui list, but for not double-posting (yet), may we avoid maui reservation of resources? How are other admins limiting VMEM usage per job? How may we request some disk space available? Many thanks in advance, and specially to them who read till here ;-) Cheers, Arnau From rajatphull at gmail.com Fri Dec 16 09:43:16 2011 From: rajatphull at gmail.com (rajat phull) Date: Fri, 16 Dec 2011 11:43:16 -0500 Subject: [torqueusers] GPU Sharing in Torque-3.0.3 Message-ID: Hi All, I am trying to submit a bunch of GPU based jobs to Torque-3.0.3 (recent release). I have enabled shared mode for GPU in all the job-submission scripts. The way I have enabled sharing of GPU is as follows: For Job1: #PBS -l nodes=2:gpus=1:shared (Similarly for all other jobs) My cluster is comprised of 4 Nodes with a single GPU attached to each node. On submitting all the jobs simultaneously, with each job requiring 2 Nodes and a single GPU on each node, I am observing that all the jobs are made to run on first two nodes in the cluster. The other two nodes in the cluster are unused. *How can I enable all the nodes to be used instead of just first 2 nodes for this case? I don't want to explicitly specify the node name in my job submission scripts.* Thanks in Advance, Rajat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111216/c75d1dc5/attachment-0001.html From scrusan at ur.rochester.edu Fri Dec 16 11:48:41 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Fri, 16 Dec 2011 13:48:41 -0500 Subject: [torqueusers] TORQUE - Email stopped working Message-ID: <1A48478F-3AD3-47E6-A1B1-0E345A4C42E1@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, We recently upgrade our cluster to TORQUE 2.5.9. For some reason job email does not work anymore. I can easily send email from the node via the mail/sendmail commands on all the nodes. Torque was compiled like this, and made into rpms. You don't have to read all of that, but sendmail's configuration option below is /usr/sbin/sendmail, and it does exist on the system. $ strings pbs_mom | grep mail '--host=x86_64-redhat-linux-gnu' '--build=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux' '--program-prefix=' '--prefix=/opt/torque/2.5.9' '--exec-prefix=/opt/torque/2.5.9/x86_64' '--bindir=/opt/torque/2.5.9/x86_64/bin' '--sbindir=/opt/torque/2.5.9/x86_64/sbin' '--sysconfdir=/etc' '--datadir=/opt/torque/2.5.9/share' '--libdir=/opt/torque/2.5.9/x86_64/lib' '--libexecdir=/opt/torque/2.5.9/x86_64/libexec' '--localstatedir=/var' '--sharedstatedir=/opt/torque/2.5.9/com' '--mandir=/opt/torque/2.5.9/man' '--infodir=/usr/share/info' '--includedir=/opt/torque/2.5.9/include' '--with-default-server=bhsn-int' '--with-server-home=/var/spool/torque' '--with-sendmail=/usr/sbin/sendmail' '--disable-dependency-tracking' '--disable-gui' '--without-tcl' '--with-rcp=scp' '--enable-syslog' '--disable-gcc-warnings' '--disable-munge-auth' '--with-pam=/lib64/security' '--disable-drmaa' '--enable-high-availability' '--disable-qsub-keep-override' '--disable-blcr' '--disable-cpuset' '--enable-spool' '--with-nvidia-gpus' '--disable-spool' '--enable-docs' '--disable-rpp' '--disable-munge' 'CFLAGS=-O2 -g -m64 -mtune=generic' 'CXXFLAGS=-O2 -g -m64 -mtune=generic' 'FFLAGS=-O2 -g -m64 -mtune=generic' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux' Even if I start a job via qsub with -m abe, no mail is sent. I've set the pbs_mom loglevels to 10, but I do not see any errors about sending mail. Does anyone know of an obvious place to start debugging this issue? Obviously postfix is running on all the nodes, and there isn't a mail queue, nor are there any mail messages being dropped. The one caveat in our system is that our headnode is our relayhost for all of the nodes to send mail. The problem is, there aren't any messages in the relayhost's mail queue, nor are there any in the maillog from TORQUE. When I send mail from a node via the commandline, it works without an issue, and there are entries in the relayhost's maillog. Am I missing something here? ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJO65KPAAoJENS19LGOpgqKXbAH/1VWF6/DjP3+AeT/sS9rf44K 0J7ivG9oGPZ6LIUonyK7k69ZVSx8S1vRHXGv56mIM3ImaGY555dbvGOpsTlW0J3a XI9Moljp61efxuFoNu27Ix/nKlmcnVfyxQSOpjJLOuVXhy3Mdhe9qKMq8KyUHQWk CQT4VwSSmdoo9h8sMm3P9S9ktUmbP+2XLvXnT/pXZ9vG8J/lwIxt/wwCduyXlcaE pHXH55qKzUZmxhOFIpv40B1I303pYY/8KoRaqTM+yDqLImltKwjWM/5xQDVsgiVc gu0PYiPAYtVKyPllrddSqmU395MpB/mBNS5MjFmNWnO2LWgm/ukk82ZDG8OwleQ= =e4tg -----END PGP SIGNATURE----- From mej at lbl.gov Fri Dec 16 13:15:53 2011 From: mej at lbl.gov (Michael Jennings) Date: Fri, 16 Dec 2011 12:15:53 -0800 Subject: [torqueusers] TORQUE - Email stopped working In-Reply-To: <1A48478F-3AD3-47E6-A1B1-0E345A4C42E1@ur.rochester.edu> References: <1A48478F-3AD3-47E6-A1B1-0E345A4C42E1@ur.rochester.edu> Message-ID: <20111216201550.GE3348@lbl.gov> On Friday, 16 December 2011, at 13:48:41 (-0500), Steve Crusan wrote: > We recently upgrade our cluster to TORQUE 2.5.9. For some reason job > email does not work anymore. I can easily send email from the node > via the mail/sendmail commands on all the nodes. Torque was compiled > like this, and made into rpms. > > You don't have to read all of that, but sendmail's configuration > option below is /usr/sbin/sendmail, and it does exist on the system. > > Even if I start a job via qsub with -m abe, no mail is sent. I've t > the pbs_mom loglevels to 10, but I do not see any errors about > sending mail. Does anyone know of an obvious place to start > debugging this issue? > > Obviously postfix is running on all the nodes, and there isn't a > mail queue, nor are there any mail messages being dropped. > > The one caveat in our system is that our headnode is our relayhost > for all of the nodes to send mail. The problem is, there aren't any > messages in the relayhost's mail queue, nor are there any in the > maillog from TORQUE. When I send mail from a node via the > commandline, it works without an issue, and there are entries in the > relayhost's maillog. We have a very similar setup, and awhile back mail stopped working for me too. I patched TORQUE to add more logging of mail attempts and why they succeeded, failed, or were not done at all. You might've seen messages like this one if your loglevel is 3 or higher: 12/13/2011 23:53:11;000d;PBS_Server;Job;1234.host;Not sending email: User does not want mail of this type. This comes from src/server/svr_mail.c in the svr_mailowner() routine. All messages have either "mail" or "popen" in them, so try grepping for those with a loglevel of 5. See if anything pops up. Michael -- Michael Jennings Linux Systems and Cluster Engineer High-Performance Computing Services Bldg 50B-3209E W: 510-495-2687 MS 050C-3396 F: 510-486-8615 From dbeer at adaptivecomputing.com Fri Dec 16 13:16:43 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 16 Dec 2011 13:16:43 -0700 (MST) Subject: [torqueusers] TORQUE - Email stopped working In-Reply-To: <1A48478F-3AD3-47E6-A1B1-0E345A4C42E1@ur.rochester.edu> Message-ID: <74e98a1b-cb31-4219-a505-7d29b4afe3c3@mail> ----- Original Message ----- > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi all, > > We recently upgrade our cluster to TORQUE 2.5.9. For some reason job > email does not work anymore. I can easily send email from the node > via the mail/sendmail commands on all the nodes. Torque was > compiled like this, and made into rpms. > > You don't have to read all of that, but sendmail's configuration > option below is /usr/sbin/sendmail, and it does exist on the system. > > $ strings pbs_mom | grep mail > '--host=x86_64-redhat-linux-gnu' '--build=x86_64-redhat-linux-gnu' > '--target=x86_64-redhat-linux' '--program-prefix=' > '--prefix=/opt/torque/2.5.9' > '--exec-prefix=/opt/torque/2.5.9/x86_64' > '--bindir=/opt/torque/2.5.9/x86_64/bin' > '--sbindir=/opt/torque/2.5.9/x86_64/sbin' '--sysconfdir=/etc' > '--datadir=/opt/torque/2.5.9/share' > '--libdir=/opt/torque/2.5.9/x86_64/lib' > '--libexecdir=/opt/torque/2.5.9/x86_64/libexec' > '--localstatedir=/var' '--sharedstatedir=/opt/torque/2.5.9/com' > '--mandir=/opt/torque/2.5.9/man' '--infodir=/usr/share/info' > '--includedir=/opt/torque/2.5.9/include' > '--with-default-server=bhsn-int' > '--with-server-home=/var/spool/torque' > '--with-sendmail=/usr/sbin/sendmail' '--disable-dependency-tracking' > '--disable-gui' '--without-tcl' '--with-rcp=scp' '--enable-syslog' > '--disable-gcc-warnings' '--disable-munge-auth' > '--with-pam=/lib64/security' '--disable-drmaa' > '--enable-high-availability' '--disable-qsub-keep-override' > '--disable-blcr' '--disable-cpuset' '--enable-s > pool' '--with-nvidia-gpus' '--disable-spool' '--enable-docs' > '--disable-rpp' '--disable-munge' 'CFLAGS=-O2 -g -m64 > -mtune=generic' 'CXXFLAGS=-O2 -g -m64 -mtune=generic' 'FFLAGS=-O2 > -g -m64 -mtune=generic' 'build_alias=x86_64-redhat-linux-gnu' > 'host_alias=x86_64-redhat-linux-gnu' > 'target_alias=x86_64-redhat-linux' > > > Even if I start a job via qsub with -m abe, no mail is sent. I've > set the pbs_mom loglevels to 10, but I do not see any errors about > sending mail. Does anyone know of an obvious place to start > debugging this issue? > > Obviously postfix is running on all the nodes, and there isn't a > mail queue, nor are there any mail messages being dropped. > > The one caveat in our system is that our headnode is our relayhost > for all of the nodes to send mail. The problem is, there aren't any > messages in the relayhost's mail queue, nor are there any in the > maillog from TORQUE. When I send mail from a node via the > commandline, it works without an issue, and there are entries in > the relayhost's maillog. > > Am I missing something here? > > > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ I don't see anything obvious that you're missing, but if you look at the logs, are there any messages about failures when sending emails? -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From scrusan at ur.rochester.edu Fri Dec 16 13:29:48 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Fri, 16 Dec 2011 15:29:48 -0500 Subject: [torqueusers] TORQUE - Email stopped working In-Reply-To: <74e98a1b-cb31-4219-a505-7d29b4afe3c3@mail> References: <74e98a1b-cb31-4219-a505-7d29b4afe3c3@mail> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Dec 16, 2011, at 3:16 PM, David Beer wrote: >> > > I don't see anything obvious that you're missing, but if you look at the logs, are there any messages about failures when sending emails? Ah, I got it. Bad symlink on our headnode. I was looking in the wrong place for the errors... *Bangs head off keyboard* Glad it's a Friday. Thanks guys! > > > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJO66pIAAoJENS19LGOpgqKSEcH+wRMRkSaPHtul6jBMPr48SSP SYtKsVmD4fAU7qxktuTEo9JkaHvjm1cDYpBPlV1WlxLyM2OwU72tE5xERIKCdNND KZ5lHUGy6NV81pmiVsS6Y6quQPgn9Ibol4Fy8NXEqef4B0Xqy2wI+vp3Wulkmri0 B6gQPba76odb59dMVvLYJzQt4c8zfp3KShF3gR+LRzQ0aWjbAwanVTwRGjBYwSoH wQlX8Y49xzFpqismSrXi3bu0UvB7RO5HpKazpmLm3T+zpWko6xdLU/A0DoQSUvtC B86FKt4gn+S7ja3QPm6xvVxZrybtmvhmx6zRLwnF9M0mBChFwfzzCDuwDOWhbv8= =6mN7 -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Sun Dec 18 23:11:05 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 19 Dec 2011 17:11:05 +1100 Subject: [torqueusers] Array IDs greater than 99999 In-Reply-To: <116B86BD-7177-49C2-BEF5-ABA9F848F7ED@gmail.com> References: <116B86BD-7177-49C2-BEF5-ABA9F848F7ED@gmail.com> Message-ID: <4EEED579.5040408@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/12/11 03:48, Christopher Webber wrote: > I have a PI who is trying to use array ID's greater than > 99999 and we are getting back a "Bad Job Array Request" error. Looking at the source PBS_MAXJOBARRAYLEN is set to 6 in Torque 2.4 and 2.5 and 7 in 3.0. That defines the maximum number of characters that can represent the array part (including the dash), so 99999 would indeed be the largest for Torque 2.4 and 2.5. Not sure what the implications would be of recompiling with a larger value of PBS_MAXJOBARRAYLEN. I would strongly suggest you open a bug about this here: http://www.clusterresources.com/bugzilla/ cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7u1XgACgkQO2KABBYQAh/b0wCdGlH4ciSVC3MtW55qTR+0WthG 5gwAoI9TF6qI3Evb15PnjsF1ArEesc6q =eDzl -----END PGP SIGNATURE----- From glen.beane at gmail.com Mon Dec 19 05:29:23 2011 From: glen.beane at gmail.com (Glen Beane) Date: Mon, 19 Dec 2011 07:29:23 -0500 Subject: [torqueusers] Array IDs greater than 99999 In-Reply-To: <4EEED579.5040408@unimelb.edu.au> References: <116B86BD-7177-49C2-BEF5-ABA9F848F7ED@gmail.com> <4EEED579.5040408@unimelb.edu.au> Message-ID: On Mon, Dec 19, 2011 at 1:11 AM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 15/12/11 03:48, Christopher Webber wrote: > >> I have a PI who is trying to use array ID's greater than >> 99999 and we are getting back a "Bad Job Array Request" error. > > Looking at the source PBS_MAXJOBARRAYLEN is set to 6 in Torque > 2.4 and 2.5 and 7 in 3.0. ? That defines the maximum number of > characters that can represent the array part (including the dash), > so 99999 would indeed be the largest for Torque 2.4 and 2.5. > > Not sure what the implications would be of recompiling with a > larger value of PBS_MAXJOBARRAYLEN. since the first portion of a .JB file is a binary dump of the job struct, changing this value would make that particular instance of TORQUE unable to read .JB files from a previous version of TORQUE without some extra "job upgrade" code (there have been changes that increased fixed size arrays in this struct before, and TORQUE is able to "upgrade" the struct from a previous version to the current one, but it requires some extra code in src/server/job_qs_upgrade.c > I would strongly suggest you open a bug about this here: > > http://www.clusterresources.com/bugzilla/ > > cheers! > Chris > - -- > ? ?Christopher Samuel - Senior Systems Administrator > ?VLSCI - Victorian Life Sciences Computation Initiative > ?Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > ? ? ? ? http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7u1XgACgkQO2KABBYQAh/b0wCdGlH4ciSVC3MtW55qTR+0WthG > 5gwAoI9TF6qI3Evb15PnjsF1ArEesc6q > =eDzl > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From j.kasiak at gmail.com Mon Dec 19 06:52:27 2011 From: j.kasiak at gmail.com (Jan Kasiak) Date: Mon, 19 Dec 2011 08:52:27 -0500 Subject: [torqueusers] GPU Sharing in Torque-3.0.3 In-Reply-To: References: Message-ID: Hi Rajat, You might want to try a maui option first. Based on this: Example 3: Pack tasks onto loaded nodes first. NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF=JOBCOUNT I would try negating that by: NODECFG[DEFAULT] PRIORITYF='-JOBCOUNT' -Jan On Fri, Dec 16, 2011 at 11:43 AM, rajat phull wrote: > Hi All, > > I am trying to submit a bunch of GPU based jobs to Torque-3.0.3 (recent > release). I have enabled shared mode for GPU in all the job-submission > scripts. The way I have enabled sharing of GPU is as follows: > For Job1: #PBS -l nodes=2:gpus=1:shared ? (Similarly for all other jobs) > > My cluster is comprised of 4 Nodes with a single GPU attached to each node. > On submitting all the jobs simultaneously, with each job requiring 2 Nodes > and a single GPU on each node, I am observing that all the jobs are made to > run on first two nodes in the cluster. The other two nodes in the cluster > are unused. How can I enable all the nodes to be used instead of just first > 2 nodes for this case? I don't want to explicitly specify the node name in > my job submission scripts. > > Thanks in Advance, > Rajat > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From d.yilmaz at rug.nl Sun Dec 18 05:46:45 2011 From: d.yilmaz at rug.nl (D.Yilmaz) Date: Sun, 18 Dec 2011 13:46:45 +0100 Subject: [torqueusers] Questions about Torque In-Reply-To: <77208c6277fd.4eede078@rug.nl> References: <7730fcaa79bb.4eeddf49@rug.nl> <76e0fa551719.4eeddf86@rug.nl> <7770912a1cc0.4eeddfc2@rug.nl> <7750b6e45328.4eeddfff@rug.nl> <7720c2243c22.4eede03b@rug.nl> <77208c6277fd.4eede078@rug.nl> Message-ID: <76e083a25cfc.4eedeec5@rug.nl> Dear All, I am very new at Torque and I would like to install the torque system in to only a single system. I have a couple of questions: 1. Can I install Master Node, Submit Node and Compute node in to a single system? 2. Will having all three processes in a single system cause any problems? 3. Do you have any comments or recommendations for such a system? Thanks a lot in advance & B.Regards, Dogan YILMAZ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111218/ad8c2661/attachment.html From torque4.mailinglist at gmail.com Mon Dec 19 01:55:25 2011 From: torque4.mailinglist at gmail.com (Torque4 User) Date: Mon, 19 Dec 2011 16:55:25 +0800 Subject: [torqueusers] Torque 4.0 Installation Problem Message-ID: I downloaded Torque 4.0 beta via svn co svn://clusterresources.com/torque/trunk but encountered the problem with torque.setup. serverdb is not able to create and qterm is not able to terminate pbs_server. Any help is appreciated. Thanks # ./torque.setup root initializing TORQUE (admin: root at torque4a) You have selected to start pbs_server in create mode. If the server database exists it will be overwritten. do you wish to continue y/(n)?y Error with socket_connect - -2-cannot connect to port 4 in socket_connect_addr - connection refused Communication failure. qmgr: cannot connect to server (errno=15095) Error getting connection to socket ERROR: cannot set TORQUE admins Error with socket_connect - -2-cannot connect to port 4 in socket_connect_addr - connection refused Communication failure. qterm: could not connect to server '' (15095) Error getting connection to socket # ps -ef | grep pbs root 28951 1 0 16:55 ? 00:00:00 pbs_server -t create root 28969 2066 0 16:57 pts/0 00:00:00 grep pbs # ls -l /var/spool/torque/server_priv total 52 drwxr-xr-x 2 root root 4096 Dec 19 14:33 accounting drwxr-x--- 2 root root 4096 Dec 19 2011 acl_groups drwxr-x--- 2 root root 4096 Dec 19 2011 acl_hosts drwxr-x--- 2 root root 4096 Dec 19 2011 acl_svr drwxr-x--- 2 root root 4096 Dec 19 2011 acl_users drwxr-x--- 2 root root 4096 Dec 19 2011 arrays drwxr-x--- 2 root root 4096 Dec 19 2011 credentials drwxr-x--- 2 root root 4096 Dec 19 2011 disallowed_types drwxr-x--- 2 root root 4096 Dec 19 2011 hostlist drwxr-x--- 2 root root 4096 Dec 19 2011 jobs -rw-r--r-- 1 root root 367 Dec 19 14:36 nodes drwxr-x--- 2 root root 4096 Dec 19 2011 queues -rw------- 1 root root 6 Dec 19 16:55 server.lock -rw------- 1 root root 0 Dec 19 14:33 tracking # qterm Error with socket_connect - -2-cannot connect to port 4 in socket_connect_addr - connection refused Communication failure. qterm: could not connect to server '' (15095) Error getting connection to socket -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111219/6addf79f/attachment.html From knielson at adaptivecomputing.com Mon Dec 19 08:33:49 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 19 Dec 2011 08:33:49 -0700 (MST) Subject: [torqueusers] Torque 4.0 Installation Problem In-Reply-To: Message-ID: <30a95afd-5652-4130-a585-517c21e219b7@mail> Don't forget to start trqauthd. This is the new authorization daemon that takes the place of pbs_iff. This is something TORQUE users will need to get used to starting in 4.0. Ken ----- Original Message ----- > From: "Torque4 User" > To: torqueusers at supercluster.org > Sent: Monday, December 19, 2011 1:55:25 AM > Subject: [torqueusers] Torque 4.0 Installation Problem > > > I downloaded Torque 4.0 beta via > svn co svn:// clusterresources.com/torque/trunk > > but encountered the problem with torque.setup. serverdb is not able > to create and qterm is not able to terminate pbs_server. > Any help is appreciated. > Thanks > > > # ./torque.setup root > initializing TORQUE (admin: root at torque4a) > > You have selected to start pbs_server in create mode. > If the server database exists it will be overwritten. > do you wish to continue y/(n)?y > Error with socket_connect - -2-cannot connect to port 4 in > socket_connect_addr - connection refused > Communication failure. > qmgr: cannot connect to server (errno=15095) Error getting connection > to socket > ERROR: cannot set TORQUE admins > Error with socket_connect - -2-cannot connect to port 4 in > socket_connect_addr - connection refused > Communication failure. > qterm: could not connect to server '' (15095) Error getting > connection to socket > > > # ps -ef | grep pbs > root 28951 1 0 16:55 ? 00:00:00 pbs_server -t create > root 28969 2066 0 16:57 pts/0 00:00:00 grep pbs > > > # ls -l /var/spool/torque/server_priv > total 52 > drwxr-xr-x 2 root root 4096 Dec 19 14:33 accounting > drwxr-x--- 2 root root 4096 Dec 19 2011 acl_groups > drwxr-x--- 2 root root 4096 Dec 19 2011 acl_hosts > drwxr-x--- 2 root root 4096 Dec 19 2011 acl_svr > drwxr-x--- 2 root root 4096 Dec 19 2011 acl_users > drwxr-x--- 2 root root 4096 Dec 19 2011 arrays > drwxr-x--- 2 root root 4096 Dec 19 2011 credentials > drwxr-x--- 2 root root 4096 Dec 19 2011 disallowed_types > drwxr-x--- 2 root root 4096 Dec 19 2011 hostlist > drwxr-x--- 2 root root 4096 Dec 19 2011 jobs > -rw-r--r-- 1 root root 367 Dec 19 14:36 nodes > drwxr-x--- 2 root root 4096 Dec 19 2011 queues > -rw------- 1 root root 6 Dec 19 16:55 server.lock > -rw------- 1 root root 0 Dec 19 14:33 tracking > > > # qterm > Error with socket_connect - -2-cannot connect to port 4 in > socket_connect_addr - connection refused > Communication failure. > qterm: could not connect to server '' (15095) Error getting > connection to socket > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From gus at ldeo.columbia.edu Mon Dec 19 13:10:35 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Mon, 19 Dec 2011 15:10:35 -0500 Subject: [torqueusers] Questions about Torque In-Reply-To: <76e083a25cfc.4eedeec5@rug.nl> References: <7730fcaa79bb.4eeddf49@rug.nl> <76e0fa551719.4eeddf86@rug.nl> <7770912a1cc0.4eeddfc2@rug.nl> <7750b6e45328.4eeddfff@rug.nl> <7720c2243c22.4eede03b@rug.nl> <77208c6277fd.4eede078@rug.nl> <76e083a25cfc.4eedeec5@rug.nl> Message-ID: <4612D61C-8690-4A25-B04A-00C3670CEE28@ldeo.columbia.edu> Hi Dogan I did this in a couple of standalone machines. It works. There is no conflict in having the pbs_server, pbs_sched and pbs_mom running in a standalone machine. You can also install maui, if you need a more sophisticated scheduler, and turn off pbs_sched, but you may not need that much in a single machine. You could install Torque [and Maui] from source, which is probably the best, because you can choose configuration options, but takes some work. Fedora used to have Torque RPMs, which is easier to install, but lacks configuration choices. Maybe it still has [I haven't checked lately], and maybe other Linux distributions [e.g. Ubuntu] have packages as well. I don't know if there are Maui RPMs/packages. If you install from packages, make sure you get all three components, perhaps also the gui if you want xpbs and xpbsmon GUIs. One of the packages may be complete and provide everything, but you need to check. I will probably forget something, but anyway, the suggestions I have for Torque installation on a single machine are: 1) Write a ${TORQUE}/server_priv/nodes file with localhost np=8 where I am guessing your computer has 8 CPU/cores, but you should adjust. The file may be already there. 2) Likewise, ${TORQUE}/server_name should have only localhost 3) Make sure your /etc/hosts file has the loopback interface line set correctly: 127.0.0.1 localhost.localdomain localhost 4) Use chkconfig [or equivalent] to schedule the pbs_sever, pbs_sched [or maui], and pbs_mom services to start when the computer boots up. 5) Set up a basic queue, or several, if you prefer. 6) To configure and use Torque, it is worth reading the Torque Admininstrator Guide: http://www.adaptivecomputing.com/resources/docs/ If you install from packages, most likely the ${TORQUE} directory is /var/torque, but you need to check this out too. I hope this helps, Gus Correa On Dec 18, 2011, at 7:46 AM, D.Yilmaz wrote: > Dear All, > > I am very new at Torque and I would like to install the torque system in to only a single system. > I have a couple of questions: > > 1. Can I install Master Node, Submit Node and Compute node in to a single system? > 2. Will having all three processes in a single system cause any problems? > 3. Do you have any comments or recommendations for such a system? > > Thanks a lot in advance & B.Regards, > Dogan YILMAZ _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From torsten at synapse.sri.com Mon Dec 19 13:23:50 2011 From: torsten at synapse.sri.com (Torsten Rohlfing) Date: Mon, 19 Dec 2011 12:23:50 -0800 Subject: [torqueusers] Questions about Torque In-Reply-To: <4612D61C-8690-4A25-B04A-00C3670CEE28@ldeo.columbia.edu> References: <4612D61C-8690-4A25-B04A-00C3670CEE28@ldeo.columbia.edu> Message-ID: <4EEF9D56.4060406@synapse.sri.com> I second this. Running Torque server, scheduler, mom, and submission on a single host is no problem. Have done this for 8 years with ever increasing Torque versions, and Maui on the same host, and never seen any trouble. To add to this - Fedora does still have Torque RPMs, and they're quite recent (3.0.1 on F15), but none for Maui. Best, Torsten -- Torsten Rohlfing, PhD SRI International, Neuroscience Program Senior Research Scientist 333 Ravenswood Ave, Menlo Park, CA 94025 Phone: ++1 (650) 859-3379 Fax: ++1 (650) 859-2743 torsten at synapse.sri.com http://www.stanford.edu/~rohlfing/ "Though this be madness, yet there is a method in't" From rajatphull at gmail.com Mon Dec 19 16:00:26 2011 From: rajatphull at gmail.com (rajat phull) Date: Mon, 19 Dec 2011 18:00:26 -0500 Subject: [torqueusers] GPU sharing using Torque Message-ID: Hi Jan, Thanks for your response.I tried configuring Maui on my cluster, but it doesn't support GPU based scheduling (with gpus=Num_GPU keyword in the job script). Any other suggestions to make this working. Is there a way in torque such that we can achieve sharing of GPU with all the nodes being utilized instead of just first two nodes of the cluster? (Problem Context mentioned Below) Rajat > Hi Rajat, > > You might want to try a maui option first. > Based on this: > Example 3: Pack tasks onto loaded nodes first. > NODEALLOCATIONPOLICY PRIORITY > NODECFG[DEFAULT] PRIORITYF=JOBCOUNT > > I would try negating that by: > NODECFG[DEFAULT] PRIORITYF='-JOBCOUNT' > > -Jan > > On Fri, Dec 16, 2011 at 11:43 AM, rajat phull > wrote: > > Hi All, > > > > I am trying to submit a bunch of GPU based jobs to Torque-3.0.3 (recent > > release). I have enabled shared mode for GPU in all the job-submission > > scripts. The way I have enabled sharing of GPU is as follows: > > For Job1: #PBS -l nodes=2:gpus=1:shared ? (Similarly for all other jobs) > > > > My cluster is comprised of 4 Nodes with a single GPU attached to each > node. > > On submitting all the jobs simultaneously, with each job requiring 2 > Nodes > > and a single GPU on each node, I am observing that all the jobs are made > to > > run on first two nodes in the cluster. The other two nodes in the cluster > > are unused. How can I enable all the nodes to be used instead of just > first > > 2 nodes for this case? I don't want to explicitly specify the node name > in > > my job submission scripts. > > > > Thanks in Advance, > > Rajat > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111219/27f542e7/attachment.html From samuel at unimelb.edu.au Mon Dec 19 16:35:06 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 20 Dec 2011 10:35:06 +1100 Subject: [torqueusers] GPU sharing using Torque In-Reply-To: References: Message-ID: <4EEFCA2A.8040507@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/12/11 10:00, rajat phull wrote: > Thanks for your response.I tried configuring Maui on my cluster, but it > doesn't support GPU based scheduling (with gpus=Num_GPU keyword in the > job script). Any other suggestions to make this working. If Maui doesn't support it then I suspect it's something that's a feature only of Moab, the commercial scheduler. :-( cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7vyioACgkQO2KABBYQAh+chwCfTWEx0DGLUaRQqgM0QfdIUhNT SmsAoIUn4YNiIgO0Rz51MOKqP1ajCPf1 =QvrE -----END PGP SIGNATURE----- From tbaer at utk.edu Tue Dec 20 08:59:56 2011 From: tbaer at utk.edu (Troy Baer) Date: Tue, 20 Dec 2011 10:59:56 -0500 Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: References: Message-ID: <1324396796.2542.568.camel@browncoat.jics.utk.edu> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: > I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, MOMs keep randomly segfaulting and dying. I see this in the MOM log right before dying: > > 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit > > > And something similar to this in dmesg: > > pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f rsp 00007fff19e96df0 error 4 We've also seen this on one of our systems and had to fall back to 2.5.8 on it. --Troy -- Troy Baer, HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From knielson at adaptivecomputing.com Tue Dec 20 14:03:17 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 20 Dec 2011 14:03:17 -0700 (MST) Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: <1324396796.2542.568.camel@browncoat.jics.utk.edu> Message-ID: ----- Original Message ----- > From: "Troy Baer" > To: "Torque Users Mailing List" > Sent: Tuesday, December 20, 2011 8:59:56 AM > Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting > > On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: > > I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, MOMs > > keep randomly segfaulting and dying. I see this in the MOM log > > right before dying: > > > > 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file > > descriptor (9) in tm_request, comm failed Protocol failure in > > commit > > > > > > And something similar to this in dmesg: > > > > pbs_mom[22354]: segfault at 0000000000000008 rip 00002b585249ed6f > > rsp 00007fff19e96df0 error 4 > > We've also seen this on one of our systems and had to fall back to > 2.5.8 > on it. > > --Troy > -- > Troy Baer, HPC System Administrator > National Institute for Computational Sciences, University of > Tennessee > http://www.nics.tennessee.edu/ > Phone: 865-241-4233 Could someone configure TORQUE using --with-debug and then send a stack trace of the crash? Ken From dbeer at adaptivecomputing.com Thu Dec 22 20:01:13 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 22 Dec 2011 20:01:13 -0700 (MST) Subject: [torqueusers] TORQUE 4.0 Is Officially Beta-Testing In-Reply-To: <427a7948-aea7-4a08-8a5a-e50efd352a17@mail> Message-ID: <6b5283c9-c64f-4029-a78b-62a7a80e7f12@mail> All, We are happy to announce that TORQUE 4.0 is officially transitioning to a beta stage of development. We would like to encourage all to download and and install it on your test systems. Please remember that this software is in a beta stage, and that enormous changes have been made to improve TORQUE. Even though this beta has gone through a greatly improved QA process, we expect to see some hiccups as people begin to roll it out on their test systems. You can download it here: http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.26656snapstamp.tar.gz Documentation is located here for html: http://www.adaptivecomputing.com/resources/docs/torque/4-0/help.htm and here for pdf: PDF: http://www.adaptivecomputing.com/resources/docs/torque/4-0/torqueAdminGuide-4.0.pdf We are hoping that the TORQUE community will be willing to assist us in making TORQUE 4.0 the best release of TORQUE to date. In order to effectively assist in this process, administrators would need to: 1. Configure with debugging symbols (--with-debug) 2. Ensure that core dumping is on for all server daemons (execute ulimit -c unlimited as the user that will run pbs_server or pbs_mom). 3. Be proactive in gathering as much information about any problems as possible, and share that information with the developers. Access to the TORQUE 4.0 Beta is currently limited to the TORQUE community and support is expected to be largely community-driven until early January 2012. If you are interested in participating in our official beta program kicking off in January 2012, please email David Gardner within Product Management (dgardner at adaptivecomputing.com). Participants in our invite-only beta program will receive increased support from Adaptive Computing engineering through February 2012. Space is limited, so let us know soon if you want to be considered. Your feedback is extremely important to us and we look forward to hearing back from you either through the users list and or as participants in the official Adaptive Computing beta program. There are some known issues with the beta: 1. qstat will occasionally, when a system is under a high load, crash. To workaround, run qstat again. (It doesn't crash the server). 2. Sometimes, when running parallel jobs rapidly, they get stuck in a running state. We have observed this very rarely. 3. autogen.sh is currently broken on older versions of aclocal and autoconf. 4. There is currently no error checking for inconsistencies in the nodes file and the mom_hierarchy file 5. The server will segfault under very high load (servicing hundreds of requests per second for several minutes in a row while manually qrun'ing jobs). 6. qdel -p doesn't put running jobs in a completed state, it just deletes them. Feel free to update us on known issues as they become known. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From bdandrus at nps.edu Tue Dec 27 13:16:17 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Tue, 27 Dec 2011 20:16:17 +0000 Subject: [torqueusers] scp output files and nologin Message-ID: All, Ok, wondering what the best practice is to deal with this: I touch /etc/nologin on our submit nodes so I can drain folks off for maintenance. BUT: when a job is completing and tries to scp the output file back to the submit node, it will fail because of the /etc/nologin existing. Is there a good way to deal with this? I suspect there should be a way to tell pbs_mom that certain mounts are shared, so she doesn't need to copy files, but that is only good if I am sure folks are putting their output on a shared filesystem. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111227/b3bc9d74/attachment-0001.html From lloyd_brown at byu.edu Tue Dec 27 13:21:38 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Tue, 27 Dec 2011 13:21:38 -0700 Subject: [torqueusers] scp output files and nologin In-Reply-To: References: Message-ID: <4EFA28D2.4010402@byu.edu> If you use the "$usecp" settings in your mom's mom_priv/config, then it gets copied back via cp, rather than scp. When you have a shared filesystem this should bypass the problem you're seeing. However, if you don't have a shared filesystem, that's a much harder situation. In that case, in theory, the host where the user submitted the job may be the only place where the filesystem exists or is mounted, so it really is dependent on logging into that node. You might be able to bypass this via some clever PAM trickery, eg. if in maintenance mode then only allow logins for users who both have currently-running jobs, AND are coming from a hostname corresponding to your compute nodes. You might have to write a PAM module, though. I'm not aware of any that will do this. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 12/27/2011 01:16 PM, Andrus, Brian Contractor wrote: > All, > > > > Ok, wondering what the best practice is to deal with this: > > > > I touch /etc/nologin on our submit nodes so I can drain folks off for > maintenance. > > BUT: when a job is completing and tries to scp the output file back to > the submit node, it will fail because of the /etc/nologin existing. > > > > Is there a good way to deal with this? > > I suspect there should be a way to tell pbs_mom that certain mounts are > shared, so she doesn?t need to copy files, but that is only good if I am > sure folks are putting their output on a shared filesystem. > > > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Tue Dec 27 13:30:16 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 27 Dec 2011 20:30:16 +0000 Subject: [torqueusers] scp output files and nologin In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E221017F027B@ITSDAG1D.its.iastate.edu> Brian, For all filesystems that are shared filesystems (e.g. NFS mounts) For example suppose /home is mounted from /home on the node head Then place: $usecp head:/home /home in /var/spool/torque/mom_priv/config on each compute node. You will need to do this before you start the pbs_mom , or you can just restart pbs_mom on the compute node. Then it uses just a cp command rather than an scp This is faster anyway. You can place as many $usecp directives as you want. Assuming you are at least at 2.5.4 , you can also place $spool_as_final_name true in that same file, and the STDOUT and STDERR files appear in their final place rather than being spooled and then copied. This has cut down on issue with non-delivery due to quota issues. One note with the last option when the same name is used for the STDOUT and STDERR files in subsequent runs in the same directory. With spooling, the old STDOUT and STDERR files were replaced. With $spool_as_final_name true in effect, the files are appended to. This can cause confusion, especially when a jobs fails the first time, since then STDERR has the error message still in it after a subsequent successful run. My users like the true setting here, since then they can monitor the job, and if they fill up their disk quota, at least they get part of the STDERR and STDOUT files. - Jim C. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Andrus, Brian Contractor Sent: Tuesday, December 27, 2011 2:16 PM To: Torque Users Mailing List (torqueusers at supercluster.org) Subject: [torqueusers] scp output files and nologin All, Ok, wondering what the best practice is to deal with this: I touch /etc/nologin on our submit nodes so I can drain folks off for maintenance. BUT: when a job is completing and tries to scp the output file back to the submit node, it will fail because of the /etc/nologin existing. Is there a good way to deal with this? I suspect there should be a way to tell pbs_mom that certain mounts are shared, so she doesn't need to copy files, but that is only good if I am sure folks are putting their output on a shared filesystem. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111227/2ac3cab5/attachment.html From scrusan at ur.rochester.edu Tue Dec 27 15:30:00 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Tue, 27 Dec 2011 17:30:00 -0500 Subject: [torqueusers] scp output files and nologin In-Reply-To: <4EFA28D2.4010402@byu.edu> References: <4EFA28D2.4010402@byu.edu> Message-ID: <51A89ED1-69A8-416D-8EF7-3FD4D67FD008@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Dec 27, 2011, at 3:21 PM, Lloyd Brown wrote: > If you use the "$usecp" settings in your mom's mom_priv/config, then it > gets copied back via cp, rather than scp. When you have a shared > filesystem this should bypass the problem you're seeing. > > However, if you don't have a shared filesystem, that's a much harder > situation. In that case, in theory, the host where the user submitted > the job may be the only place where the filesystem exists or is mounted, > so it really is dependent on logging into that node. > > You might be able to bypass this via some clever PAM trickery, eg. if in > maintenance mode then only allow logins for users who both have > currently-running jobs, AND are coming from a hostname corresponding to > your compute nodes. You might have to write a PAM module, though. I'm > not aware of any that will do this. I believe the above can be accomplished by setting a standing reservation a few weeks before you plan to apply maintenance (usually longer than the longest allowed walltime). This will keep jobs from starting during the maintenance, or jobs that when run will be running during the maintenance window (i.e. they won't be scheduled). Our policy basically is, if we are having a maintenance day, the specified system(s) will be down, so do not expect access. Then I think you can use pam_torque to keep users from logging into nodes unless they have a job running on that node. http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/3.4hostsecurity.php This works well for us. This depends on your userbase though. If it's a small set of users you can good communication with, and they understand the benefits of a queuing/scheduling system, you can get away w/o having host security access. If you have a much larger base of users, some of which whom are used to ssh'ing to wherever they want and running code, things can turn into the wild wild west. Just my .02 ~Steve > > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 12/27/2011 01:16 PM, Andrus, Brian Contractor wrote: >> All, >> >> >> >> Ok, wondering what the best practice is to deal with this: >> >> >> >> I touch /etc/nologin on our submit nodes so I can drain folks off for >> maintenance. >> >> BUT: when a job is completing and tries to scp the output file back to >> the submit node, it will fail because of the /etc/nologin existing. >> >> >> >> Is there a good way to deal with this? >> >> I suspect there should be a way to tell pbs_mom that certain mounts are >> shared, so she doesn?t need to copy files, but that is only good if I am >> sure folks are putting their output on a shared filesystem. >> >> >> >> >> >> >> >> Brian Andrus >> >> ITACS/Research Computing >> >> Naval Postgraduate School >> >> Monterey, California >> >> voice: 831-656-6238 >> >> >> >> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJO+kbxAAoJENS19LGOpgqKTeIH/ij5IZ+B4Rtb6E+POE0FI3HC qs5LPR0HWgeDXZboqznLMDkbFw9/g0VG+O1th76aWBO4760V/3yvo81iwYNBzpI8 STIB/KCf+p7xH9ibLrU4ptWeA7yyKiVDv0RnECpquNg79LtUvciqGt3zl/3co94M vH9/6CxsMPUu6plSc8cPJO8vshZskiiG7rK5OWVwEzJs7VWk3PyXFHYAgOHI/z2A TOD0dQn+OSJy60vaz85OedwVxCHxLbuMDU0Qd5a7Wfqq6ApMD5r8I/D24WnIq6Fk Eed0jOZHt02ymjikYA85FlSvXtcMibNQm2a8XkC0XC0dD2xoUGNqxt2NcaMT63M= =GOpW -----END PGP SIGNATURE----- From lloyd_brown at byu.edu Tue Dec 27 15:42:06 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Tue, 27 Dec 2011 15:42:06 -0700 Subject: [torqueusers] scp output files and nologin In-Reply-To: <51A89ED1-69A8-416D-8EF7-3FD4D67FD008@ur.rochester.edu> References: <4EFA28D2.4010402@byu.edu> <51A89ED1-69A8-416D-8EF7-3FD4D67FD008@ur.rochester.edu> Message-ID: <4EFA49BE.8090807@byu.edu> Steve, et. al., I agree. When we have a maintenance period, we usually use a reservation to quiesce the whole system, and so this issue doesn't occur. I meant to include that in my last email, but there were a few concerns: - My email was already too long (as mine usually are) - I didn't know whether Brian was using Moab (or possibly Maui; I don't know Maui's capabilities very well). For all I know, he was using some custom-built script that calls qrun. Not what I'd recommend, but theoretically possible. - I wondered if Brian was talking about just taking that specific interactive/login/submission host offline, not the whole-system. But, assuming that it's a whole-system downtime, and using something like Moab or Maui, the reservation is *definitely* the way to go. Now if I could just remember the right syntax for mrsvctl.... Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 12/27/2011 03:30 PM, Steve Crusan wrote: > I believe the above can be accomplished by setting a standing reservation a few weeks before you plan to apply maintenance (usually longer than the longest allowed walltime). This will keep jobs from starting during the maintenance, or jobs that when run will be running during the maintenance window (i.e. they won't be scheduled). Our policy basically is, if we are having a maintenance day, the specified system(s) will be down, so do not expect access. > > Then I think you can use pam_torque to keep users from logging into nodes unless they have a job running on that node. > > http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/3.4hostsecurity.php > > This works well for us. > > This depends on your userbase though. If it's a small set of users you can good communication with, and they understand the benefits of a queuing/scheduling system, you can get away w/o having host security access. If you have a much larger base of users, some of which whom are used to ssh'ing to wherever they want and running code, things can turn into the wild wild west. > > Just my .02 > > ~Steve > >> > From jjc at iastate.edu Tue Dec 27 16:06:37 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 27 Dec 2011 23:06:37 +0000 Subject: [torqueusers] scp output files and nologin In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E221017F02DC@ITSDAG1D.its.iastate.edu> Brian, For maintenance concerning all the compute nodes, I use qstop. Usually this is more for scheduled power or A/C issues than anything else, as the head node and fileservers are on UPS, but the compute nodes are not. I issue qstop for the queues at the correct times so that no new jobs can start on that queue unless they can be guaranteed to end before the maintenance period. E.g. For a 24 hour queue large issue qstop large 24 and 1/2 or 25 hours before the maintenance window starts. Similarly, for a one hour queue short, issue qstop short 90 minutes before the start of the maintenance window. The extra 30 minutes are so that jobs have time to complete and I have time to power them down. I do this all in crontab so that this happens at the correct time. Submitted jobs can still queue up, and users can login (assuming I'm not working on the head node) and edit files (assuming I'm not working on the filesystem) For 6am maintenance windows by operations staff, I have even put the poweroff commands in crontab also, so that they shutdown even if I'm not here, or if I'm out of town. Operations staff can then power them on when maintenance is done. If you only need to perform maintenance on a few node, use pbsnodes -o to offline them, then work on them after any jobs that are running ion them complete. Maintenance on the head node or fileservers is another matter. Then I just announce a time, and just shutdown at the announced time. My users would rather be able to get on the system up to the last minute, /etc/nologin would take the system for far too long. - Jim James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Andrus, Brian Contractor Sent: Tuesday, December 27, 2011 2:16 PM To: Torque Users Mailing List (torqueusers at supercluster.org) Subject: [torqueusers] scp output files and nologin All, Ok, wondering what the best practice is to deal with this: I touch /etc/nologin on our submit nodes so I can drain folks off for maintenance. BUT: when a job is completing and tries to scp the output file back to the submit node, it will fail because of the /etc/nologin existing. Is there a good way to deal with this? I suspect there should be a way to tell pbs_mom that certain mounts are shared, so she doesn't need to copy files, but that is only good if I am sure folks are putting their output on a shared filesystem. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111227/5273fdac/attachment.html From samuel at unimelb.edu.au Tue Dec 27 21:40:08 2011 From: samuel at unimelb.edu.au (Chris Samuel) Date: Wed, 28 Dec 2011 15:40:08 +1100 Subject: [torqueusers] scp output files and nologin In-Reply-To: <4EFA49BE.8090807@byu.edu> References: <51A89ED1-69A8-416D-8EF7-3FD4D67FD008@ur.rochester.edu> <4EFA49BE.8090807@byu.edu> Message-ID: <201112281540.08611.samuel@unimelb.edu.au> On Wed, 28 Dec 2011 09:42:06 AM Lloyd Brown wrote: > Now if I could just remember the right syntax for mrsvctl.... We just use setres, it might be deprecated but at least it's usable. -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ From bdandrus at nps.edu Wed Dec 28 16:14:36 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Wed, 28 Dec 2011 23:14:36 +0000 Subject: [torqueusers] scp output files and nologin In-Reply-To: <201112281540.08611.samuel@unimelb.edu.au> References: <51A89ED1-69A8-416D-8EF7-3FD4D67FD008@ur.rochester.edu> <4EFA49BE.8090807@byu.edu> <201112281540.08611.samuel@unimelb.edu.au> Message-ID: All, Thanks for the input. We have Moab here and I do use system reservations when I have the big maintenance going on. It was the corner case of "I need to prevent folks from logging onto the submit node for a bit" that occurs on occasion. It seems that the best is to communicate to users to put their output on the shared filesystems is the optimal solution. So the only grief is when I may need to manually copy files from some nodes /tmp or such to the submit node. And when I find I must do that, it becomes an opportunity to educate that user. Thanks again for the insight! Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From wytsang at clustertech.com Thu Dec 29 00:38:33 2011 From: wytsang at clustertech.com (Clotho Tsang) Date: Thu, 29 Dec 2011 15:38:33 +0800 Subject: [torqueusers] qstat -Q shows negative queueing jobs Message-ID: When I issue following command: qsub -t1-10 <<<'sleep 10; hostname' and use "qstat -Q" to check the status, I will find that there is negative number of queueing jobs: [clotho at m4 ~]$ qstat -Q Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T ---------------- --- --- --- --- --- --- --- --- --- --- - batch 0 10 yes yes -1 2 0 0 0 0 E The problem is found at version 2.5.6, 2.5.9, 3.0.3. The platform is CentOS 5.6 x86_64. I have seen another report of the problem on version 3.0.1, but there was no reply: http://www.clusterresources.com/pipermail/torqueusers/2011-June/013040.html -- Clotho Tsang Senior Software Engineer Cluster Technology Limited Email: wytsang at clustertech.com Tel: (852) 2655-6129 Fax: (852) 2994-2101 Website: www.clustertech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111229/7a95a6b6/attachment-0001.html From jonathan.michalon at etu.unistra.fr Tue Dec 20 06:30:04 2011 From: jonathan.michalon at etu.unistra.fr (Jonathan Michalon) Date: Tue, 20 Dec 2011 14:30:04 +0100 Subject: [torqueusers] GPU sharing using Torque In-Reply-To: <4EEFCA2A.8040507@unimelb.edu.au> References: <4EEFCA2A.8040507@unimelb.edu.au> Message-ID: <20111220143004.3bc9ff20@RunningPenguin.chalmion.homelinux.net> Le Tue, 20 Dec 2011 10:35:06 +1100, Christopher Samuel a ?crit : > On 20/12/11 10:00, rajat phull wrote: > > > Thanks for your response.I tried configuring Maui on my cluster, but it > > doesn't support GPU based scheduling (with gpus=Num_GPU keyword in the > > job script). Any other suggestions to make this working. > > If Maui doesn't support it then I suspect it's something that's > a feature only of Moab, the commercial scheduler. :-( > Yes it is. But as a lot of people is demanding this feature directly for Maui, the idea of implementing it in the opensource code was proposed near me and I accepted to try to make things work. If I'm lucky enough you may see this in progress in the next six months. By the way if some reading this email have already started something or just can give hints on how to do it, feel free to contact me. Regards, -- Jonathan Michalon IT student in Strasbourg From sergey_bulk at list.ru Thu Dec 29 05:18:50 2011 From: sergey_bulk at list.ru (=?UTF-8?B?U2VyZ2V5IEJ1bGs=?=) Date: Thu, 29 Dec 2011 16:18:50 +0400 Subject: [torqueusers] =?utf-8?q?memory_limit_-l_mem_is_not_working?= Message-ID: I have torque 2.5.7-9.el6 from epel repo on SL6. When requesting resources with #PBS -l mem=400gb,nodes=node01:ppn=8 torque does not take mem parameter into account. So, I users can run 2 jobs requesting 800gb memory in total on a 500gb memory node. How to address this issue? Thank you, SN From jwbacon at tds.net Fri Dec 30 10:56:27 2011 From: jwbacon at tds.net (Jason Bacon) Date: Fri, 30 Dec 2011 11:56:27 -0600 Subject: [torqueusers] memory limit -l mem is not working In-Reply-To: References: Message-ID: <4EFDFB4B.9090507@tds.net> Check to see if your operating system supports it. Most don't, so you may have to use vmem or pvmem instead. -J On 12/29/11 06:18, Sergey Bulk wrote: > I have torque 2.5.7-9.el6 from epel repo on SL6. > > When requesting resources with > > #PBS -l mem=400gb,nodes=node01:ppn=8 > > torque does not take mem parameter into account. > > So, I users can run 2 jobs requesting 800gb memory in total > on a 500gb memory node. > > How to address this issue? > > Thank you, > SN > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From jjc at iastate.edu Fri Dec 30 11:14:24 2011 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 30 Dec 2011 18:14:24 +0000 Subject: [torqueusers] memory limit -l mem is not working In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E221017F05AC@ITSDAG1D.its.iastate.edu> Sergey, There are two options: 1) For each queue, set reasonable low defaults for pmem and vmem e.g. for nodes which have 512gb and 32 processor cores, set to 512gb/32=16gb qmgr -c 'set queue large resources_default.pmem = 16gb' qmgr -c 'set queue large resources_default.vmem = 16gb' This will force users to specify pmem= and vmem= if they want more than this, otherwise they just get 16gb for both. 2) Write a submit filter which scan for mem= (and maybe vmem= pmem= and ndoes=N:ppn=M The you can alter the job submitted. E.g. on a 512GB node with 32 processors (i.e. 16gb per processor), the submit filter could calculate ceiling(mem/mem_per_processor) = ceiling(400gb/16gb) = 25 Then use that value to change the ppn= to (max(8,25)) in the job request. This just reserves as many processors as needed with each getting their share of the memory, unless they already have more processors reserved. - Jim C. >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Sergey Bulk >Sent: Thursday, December 29, 2011 6:19 AM >To: torqueusers at supercluster.org >Subject: [torqueusers] memory limit -l mem is not working > >I have torque 2.5.7-9.el6 from epel repo on SL6. > >When requesting resources with > >#PBS -l mem=400gb,nodes=node01:ppn=8 > >torque does not take mem parameter into account. > >So, I users can run 2 jobs requesting 800gb memory in total >on a 500gb memory node. > >How to address this issue? > >Thank you, >SN >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers