From sm4082 at nyu.edu Thu Mar 1 07:58:41 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 1 Mar 2012 09:58:41 -0500 Subject: [torqueusers] array job and resources In-Reply-To: References: Message-ID: I don't think this is possible. In array of jobs each job is treated as a separate one from others. The only advantage with the arrays is that you don't have do qsub multiple times when you need to change very little in your script, let's say changing input file from in1 to in2, etc. What you saw is reasonable behavior as it put jobs one after another. If you really want each job to run on different node the only is to use for loop. I am not sure this can be done with arrays. for i in {1..12}; do qsub -l nodes=node${i} ; done Since ppn is not declared it is 1 by default. If you want node01 but not node1 you can use if block. Or may be seq can add zeros with some flag. Not sure about it. You can use the same script without modifying since command line has precedence. Otherwise you can simply include just bash liners in the script. Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Feb 29, 2012, at 5:17, Sergey Bulk wrote: > I have torque 2.5.7-9.el6 from epel repo on SL6. > I have 24-core nodes node01-node12. > > When requesting resources for an array job > > #!/bin/bash > #PBS -t 1-12 > #PBS -l nodes=node01:ppn=1+node02:ppn=1+node03:ppn=1+node04:ppn=1+....+node12:ppn=1 > #PBS -d . > > for f in `seq 1 1000`; > do > ps aux > done; > > I would expect that each job in the array should occupy its own node, > but, instead, -l option is for every job not for the whole array. > > So all jobs are running on the node01 because there is enough cores. > > What is the correct way to request resources for the whole array job > rather then for the single job in the array? > > Thank you, > SN > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From adaptivecomputing at bridgemailsystem.com Thu Mar 1 09:26:09 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Thu, 1 Mar 2012 08:26:09 -0800 (PST) Subject: [torqueusers] News About Moab Technology from Adaptive Computing Message-ID: <9364766.1330619171944.JavaMail.root@mail2.bms.local> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120301/10b82aa4/attachment-0001.html From sm4082 at nyu.edu Thu Mar 1 10:37:25 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 1 Mar 2012 12:37:25 -0500 Subject: [torqueusers] Step Change in Job Arrays In-Reply-To: References: Message-ID: Hello, First I'm sorry for sending multiple emails. After sending the last message I realized we can't put it in the qsub submit filter. If we put it in qsub filter it still complains bad array request. So I made another wrapper that works from the system side. We just have to create another wrapper so that wrapper goes through this and changes arguments and resubmits it to qsub. You can use the code in my previous email and put it in a file such as /user/local/bin/qsub_wrapper.sh on login node where users submit the jobs. Then add this line /opt/torque/bin/qsub "$@" to the end of the code. This line submits the job with changed arguments. If you have submit filter it works like always. Then we need to create an alias for our wrapper. I have my alias as [root at login-0-0 ~]# cat /etc/profile.d/qsub.sh alias qsub='/usr/local/bin/qsub_wrapper.sh' [root at login-0-0 ~]# ls -l $_ -rwxr-xr-x 1 root root 44 Mar 1 11:58 /etc/profile.d/qsub.sh [root at login-0-0 ~]# cat /usr/local/bin/qsub_wrapper.sh #!/bin/bash args=("$@") for((arg=0;arg<$#;arg++)) do if [ "${args[$arg]}" = "-t" ] then if echo "${args[$(($arg+1))]}" | egrep "^(^0|^[1-9][0-9]*)-[1-9][0-9]*:[1-9][0-9]*$" > /dev/null 2>&1 then array_arg="${args[$(($arg+1))]}" array_arg="$(seq -s, $(echo $array_arg | cut -f1 -d-) $(echo $array_arg | cut -f2 -d:) $(echo $array_arg | cut -f1 -d: | cut -f2 -d-))" str="set -- \"\${@:1:$(($arg+1))}\" \"${array_arg}\" \"\${@:$(($arg+3))}\"" eval `echo "$str"` break fi fi done /opt/torque/bin/qsub "$@" [root at login-0-0 ~]# [root at login-0-0 ~]# ls -l $_ -rwxr-xr-x 1 root root 528 Mar 1 12:12 /usr/local/bin/qsub_wrapper.sh [root at login-0-0 ~]# I have tested and it works. See below for an example [user at login-0-0 array_jobs]$ qsub -t 0-15:3 array.pbs 157713[].crunch.local [user at login-0-0 array_jobs]$ qstat -t -u sm4082 crunch.local: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 157713[0].crunch sm4082 s48 arraytest-0 2790 1 1 -- 00:05 R -- 157713[3].crunch sm4082 s48 arraytest-3 2804 1 1 -- 00:05 R -- 157713[6].crunch sm4082 s48 arraytest-6 2825 1 1 -- 00:05 R -- 157713[9].crunch sm4082 s48 arraytest-9 2849 1 1 -- 00:05 R -- 157713[12].crunc sm4082 s48 arraytest-12 2865 1 1 -- 00:05 R -- 157713[15].crunc sm4082 s48 arraytest-15 16818 1 1 -- 00:05 R -- [user at login-0-0 array_jobs]$ Sreedhar. On Wed, Feb 29, 2012 at 10:51 PM, Sreedhar Manchu wrote: > Hello, > > Putting the code below in qsub wrapper would make torque recognize step > change like 1-20:2 or 0-19:3, etc. This works only when -t is mentioned in > the command line. I made my script recognize the same even when users > mention it inside the script as well. But it's quite long. I guess once we > find #PBS -t 2-23:3, we just have to make the script change it to #PBS -t > 2,5,8,11,14,17,20,23. Which can be easily done with Andy's method of using > seq. > > Anyway, here is the code for wrapper for making it work with command line. > > #!/bin/sh > args=("$@") > for((arg=0;arg<$#;arg++)) > do > if [ "${args[$arg]}" = "-t" ] > then > if echo "${args[$(($arg+1))]}" | egrep > "^(^0|^[1-9][0-9]*)-[1-9][0-9]*:[1-9][0-9]*$" > /dev/null 2>&1 > then > array_arg="${args[$(($arg+1))]}" > array_arg="$(seq -s, $(echo $array_arg | cut -f1 -d-) $(echo > $array_arg | cut -f2 -d:) $(echo $array_arg | cut -f1 -d: | cut -f2 -d-))" > str="set -- \"\${@:1:$(($arg+1))}\" \"${array_arg}\" > \"\${@:$(($arg+3))}\"" > eval `echo "$str"` > break > fi > fi > done > > In case you already have a wrapper in place, it doesn't do any harm to > keep it at the beginning of it. > > Sreedhar. > > > > On Wed, Feb 8, 2012 at 9:47 AM, Ibad Kureshi U0850037 wrote: > >> Thanks Glen, Andy >> >> Andy: Nice! >> >> -Ibad >> >> >> ________________________________________ >> From: torqueusers-bounces at supercluster.org [ >> torqueusers-bounces at supercluster.org] On Behalf Of Andrew Caird [ >> acaird at umich.edu] >> Sent: Wednesday, February 08, 2012 2:28 PM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Step Change in Job Arrays >> >> On Wed, Feb 8, 2012 at 9:02 AM, Glen Beane > glen.beane at gmail.com>> wrote: >> On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037 >> > wrote: >> > Hello, >> > >> > I was wondering is someone could tell me how to adjust the step size in >> a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small >> cluster and our users want to submit arrays. >> > >> > One the SGE and the Moab/Torque based systems >> > >> > $ -t 1-20:2 >> > >> > or >> > >> > #PBS -t 1-20:2 >> > >> > respectively, gives them 10 jobs with even ID numbers. >> > >> > How can this be done with Torque? It throws out "qsub: Bad Job Array >> Request" error >> > >> > Have not been able to find much literature on this. >> > >> > Thanks >> >> >> this is not currently supported, but it is a great feature request. >> >> unfortunately the only option would be to explicitly specify each array >> ID: >> >> #PBS -t 2,4,6,8,10 ...20 >> >> Or: >> >> qsub -t `seq -s, 2 2 20` pbsfile.txt >> >> in case you don't want to type all the numbers. >> >> --andy >> >> >> --- >> This transmission is confidential and may be legally privileged. If you >> receive it in error, please notify us immediately by e-mail and remove it >> from your system. If the content of this e-mail does not relate to the >> business of the University of Huddersfield, then we do not endorse it and >> will accept no liability. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > > -- > Sreedhar Manchu > HPC Support Specialist > New York University > 251 Mercer Street > New York, NY 10012-1110 > > -- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120301/0dfeea83/attachment.html From sergey_bulk at list.ru Fri Mar 2 01:31:48 2012 From: sergey_bulk at list.ru (=?UTF-8?B?U2VyZ2V5IEJ1bGs=?=) Date: Fri, 02 Mar 2012 12:31:48 +0400 Subject: [torqueusers] =?utf-8?q?array_job_and_resources?= In-Reply-To: References: Message-ID: Hi, Sreedhar! Thank you for your responce. It was important for me to understand that it is not the bug of an array job but its feature. So it exists just for grouping jobs not for resource allocation. I've solved my current problem using properties of nodes. qsub -l nodes=bigmem mytask.sh - is sending to the big memory node qsub -l anothertask.sh - is sending to any other node Good luck! Sergey 01 ????? 2012, 19:00 ?? Sreedhar Manchu : > I don't think this is possible. In array of jobs each job is treated as a separate one from others. The only advantage with the arrays is that you don't have do qsub multiple times when you need to change very little in your script, let's say changing input file from in1 to in2, etc. > > What you saw is reasonable behavior as it put jobs one after another. If you really want each job to run on different node the only is to use for loop. I am not sure this can be done with arrays. > > for i in {1..12}; do qsub -l nodes=node${i} ; done > > Since ppn is not declared it is 1 by default. If you want node01 but not node1 you can use if block. Or may be seq can add zeros with some flag. Not sure about it. > > You can use the same script without modifying since command line has precedence. Otherwise you can simply include just bash liners in the script. > > Sreedhar. > -- > Sent from my phone. Please excuse my brevity and any typos. > > On Feb 29, 2012, at 5:17, Sergey Bulk wrote: > > > I have torque 2.5.7-9.el6 from epel repo on SL6. > > I have 24-core nodes node01-node12. > > > > When requesting resources for an array job > > > > #!/bin/bash > > #PBS -t 1-12 > > #PBS -l nodes=node01:ppn=1+node02:ppn=1+node03:ppn=1+node04:ppn=1+....+node12:ppn=1 > > #PBS -d . > > > > for f in `seq 1 1000`; > > do > > ps aux > > done; > > > > I would expect that each job in the array should occupy its own node, > > but, instead, -l option is for every job not for the whole array. > > > > So all jobs are running on the node01 because there is enough cores. > > > > What is the correct way to request resources for the whole array job > > rather then for the single job in the array? > > > > Thank you, > > SN > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > From cholam20 at yahoo.co.in Fri Mar 2 03:17:33 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Fri, 2 Mar 2012 18:17:33 +0800 (SGT) Subject: [torqueusers] no more strict deadlines Message-ID: <1330683453.75109.androidMobile@web192204.mail.sg3.yahoo.com>

Hola friend.

it was so difficult living paycheck to paycheck this took all the extra stress off my back I had reached my wits end...
http://pulsmarketing.ro/currentevents/57RobertClark/ now I am complete
I thought maybe you needed some help
talk to you soon.

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120302/e9ba4fcc/attachment-0001.html From sm4082 at nyu.edu Fri Mar 2 07:48:51 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 2 Mar 2012 09:48:51 -0500 Subject: [torqueusers] array job and resources In-Reply-To: References: Message-ID: <630D3B01-5E47-4373-AF6C-9094D54A3508@nyu.edu> Hi Sergey, Glad to hear that you found the solution. I also use node properties heavily to place right jobs on right nodes. It works great especially with moab. This feature comes handy when we have nodes with different memory sizes on cluster. Regards, Sreedhar. On Mar 2, 2012, at 3:31 AM, Sergey Bulk wrote: > Hi, Sreedhar! > > Thank you for your responce. > It was important for me to understand that it is not > the bug of an array job but its feature. > So it exists just for grouping jobs not for resource allocation. > > I've solved my current problem using properties of nodes. > qsub -l nodes=bigmem mytask.sh - is sending to the big memory node > qsub -l anothertask.sh - is sending to any other node > > Good luck! > Sergey > > > 01 ????? 2012, 19:00 ?? Sreedhar Manchu : >> I don't think this is possible. In array of jobs each job is treated as a separate one from others. The only advantage with the arrays is that you don't have do qsub multiple times when you need to change very little in your script, let's say changing input file from in1 to in2, etc. >> >> What you saw is reasonable behavior as it put jobs one after another. If you really want each job to run on different node the only is to use for loop. I am not sure this can be done with arrays. >> >> for i in {1..12}; do qsub -l nodes=node${i} ; done >> >> Since ppn is not declared it is 1 by default. If you want node01 but not node1 you can use if block. Or may be seq can add zeros with some flag. Not sure about it. >> >> You can use the same script without modifying since command line has precedence. Otherwise you can simply include just bash liners in the script. >> >> Sreedhar. >> -- >> Sent from my phone. Please excuse my brevity and any typos. >> >> On Feb 29, 2012, at 5:17, Sergey Bulk wrote: >> >>> I have torque 2.5.7-9.el6 from epel repo on SL6. >>> I have 24-core nodes node01-node12. >>> >>> When requesting resources for an array job >>> >>> #!/bin/bash >>> #PBS -t 1-12 >>> #PBS -l nodes=node01:ppn=1+node02:ppn=1+node03:ppn=1+node04:ppn=1+....+node12:ppn=1 >>> #PBS -d . >>> >>> for f in `seq 1 1000`; >>> do >>> ps aux >>> done; >>> >>> I would expect that each job in the array should occupy its own node, >>> but, instead, -l option is for every job not for the whole array. >>> >>> So all jobs are running on the node01 because there is enough cores. >>> >>> What is the correct way to request resources for the whole array job >>> rather then for the single job in the array? >>> >>> Thank you, >>> SN >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 From akshar.bhosale at gmail.com Sat Mar 3 04:44:55 2012 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Sat, 3 Mar 2012 17:14:55 +0530 Subject: [torqueusers] nodes difference for job Message-ID: hi, We have torque 2.5.8 and maui 2.3.6 configured on centos 5.2 cluster. we have 2 partitions using maui of the nodes parA and parB. checkjob says that job should go to one of the nodes from parA say node5. checknode node5 says that it is waiting for job to get on it. showstart says that job should start now on the node5; but unfortunately job remians in idle state in spite of availability of node. "Also checkjob -vvv does not show any of the nodes from parA and rejected for all the nodes from parB." it should have shown some nodes form parA atleast. this is with only a perticular type of job all the other jobs dont have this problem. script is as follows : #PBS -N myjob #PBS -l nodes=1:ppn=16 #PBS -l walltime=24:00:00 #PBS -e error2.txt #PBS -o output2.txt #PBS -r n #!/bin/bash ##PBS -V echo PBS JOB id is $PBS_JOBID echo PBS_NODEFILE is $PBS_NODEFILE echo PBS_QUEUE is $PBS_QUEUE cat $PBS_NODEFILE cd $PBS_O_WORKDIR /home/raman/NAMD_2.8_Linux-x86_64/charmrun +p8 /home/raman/NAMD_2.8_Linux-x86_64 namd2 npt18.inp > npt18.out ############## -akshar From samuel at unimelb.edu.au Sun Mar 4 16:14:42 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 05 Mar 2012 10:14:42 +1100 Subject: [torqueusers] reducing energy usage of torque In-Reply-To: <4F4C85A0.3080004@gmail.com> References: <4F4C85A0.3080004@gmail.com> Message-ID: <4F53F762.4010301@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 28/02/12 18:43, Daniel Fernando Coimbra wrote: > I assume that by "turning off" you mean actually power down the > node. I am just curious on how do you intend to power it up again > later. It should be fairly easy to integrate this with something like xCAT to do power off and on of nodes via IPMI. Moab can already do this, and of course you probably don't want to power a node off immediately as you'll pay a cost for booting it back up again so you really only want to power it down after it's been idle for a while (as it's likely to keep on being idle for longer). cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9T92IACgkQO2KABBYQAh/v7wCfVBI057YK+Td5T3OeLkmHD86A NbMAnikP3gdqqDrrSXQNmP/3+nFoEOL9 =9SEU -----END PGP SIGNATURE----- From siegert at sfu.ca Sun Mar 4 21:34:08 2012 From: siegert at sfu.ca (Martin Siegert) Date: Sun, 4 Mar 2012 20:34:08 -0800 Subject: [torqueusers] how to (re)start mom without killing jobs? Message-ID: <20120305043408.GA6494@stikine.sfu.ca> Hi, once in a while a mom daemon dies on one of our nodes (I haven't figured out the reason for the crash, but that is not really what my question is after). Thus, I end up with having a bunch of jobs running on the node, but the node won't be used for new jobs until I restart the mom. How do I do that without killing the running processes? We used to be able to do this by using the -p argument for the mom, but apparently this is not working anymore: everytime I start the mom using "pbs_mom -p" all running jobs get killed. My feeling is that -p stopped working when we started to use cpusets (I am not absolutely sure about this since we also upgraded torque versions since then). I find the following in the mom_log: 03/04/2012 20:18:07;0002; pbs_mom;Svr;Log;Log opened 03/04/2012 20:18:07;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.10, loglevel = 0 03/04/2012 20:18:07;0002; pbs_mom;Svr;initialize_root_cpuset;Init TORQUE cpuset /dev/cpuset/torque. 03/04/2012 20:18:08;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/15' deleted. 03/04/2012 20:18:09;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/14' deleted. 03/04/2012 20:18:10;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/13' deleted. 03/04/2012 20:18:11;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/12' deleted. 03/04/2012 20:18:12;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/11' deleted. 03/04/2012 20:18:13;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/10' deleted. 03/04/2012 20:18:14;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/9' deleted. 03/04/2012 20:18:15;0002; pbs_mom;Svr;cpuset_delete;Unused cpuset '/dev/cpuset/torque/6556.dev/8' deleted. 03/04/2012 20:18:16;0002; pbs_mom;Svr;remove_defunct_cpusets;Unused cpuset '/dev/cpuset/torque/6556.dev' deleted. 03/04/2012 20:18:16;0002; pbs_mom;Svr;setpbsserver;172.18.0.40 03/04/2012 20:18:16;0002; pbs_mom;Svr;mom_server_add;server 172.18.0.40 added 03/04/2012 20:18:16;0002; pbs_mom;Svr;setpbsserver;172.18.0.40 03/04/2012 20:18:16;0002; pbs_mom;Svr;mom_server_add;server host 172.18.0.40 already added 03/04/2012 20:18:16;0002; pbs_mom;Svr;setpbsserver;localhost 03/04/2012 20:18:16;0002; pbs_mom;Svr;mom_server_add;server localhost added 03/04/2012 20:18:16;0002; pbs_mom;Svr;restricted;172.18.0.40 03/04/2012 20:18:16;0002; pbs_mom;Svr;usecp;*:/home/ /home/ 03/04/2012 20:18:16;0002; pbs_mom;Svr;usecp;*:/global/scratch/ /global/scratch/ 03/04/2012 20:18:16;0002; pbs_mom;Svr;setignvmem;0 03/04/2012 20:18:16;0002; pbs_mom;Svr;ignmem;1 03/04/2012 20:18:16;0002; pbs_mom;Svr;settmpdir;/scratch 03/04/2012 20:18:16;0080; pbs_mom;n/a;add_static;config[11] add name size value [fs=/scratch] 03/04/2012 20:18:16;0002; pbs_mom;n/a;initialize;independent 03/04/2012 20:18:16;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 03/04/2012 20:18:16;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in task_recov, open of task file 03/04/2012 20:18:16;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in task_recov, open of task file 03/04/2012 20:18:16;0002; pbs_mom;Svr;pbs_mom;Is up 03/04/2012 20:18:16;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/local/torque-2.5.10.dbg/sbin/pbs_mom 1330377127 03/04/2012 20:18:16;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.10, loglevel = 0 03/04/2012 20:18:16;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server 172.18.0.40 03/04/2012 20:18:16;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server localhost 03/04/2012 20:18:17;0008; pbs_mom;Job;scan_non_child_tasks;found exited session 19901 for task 3 in job 6536.dev 03/04/2012 20:18:17;0008; pbs_mom;Job;scan_non_child_tasks;found exited session 24272 for task 2 in job 6556.dev 03/04/2012 20:18:18;0002; pbs_mom;Svr;im_eof;End of File from addr 172.18.0.40:15001 03/04/2012 20:18:18;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server 172.18.0.40 03/04/2012 20:21:24;0002; pbs_mom;Svr;im_eof;Premature end of message from addr 127.0.0.1:15001 03/04/2012 20:21:25;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server localhost Thus, it appears that the mom first removes all cpusets in /dev/cpuset/torque before querying the server whether there still is a corresponding job supposed to be running. Anyway, can somebody tell me how to start the mom without killing jobs? Thanks!! Cheers, Martin -- Martin Siegert Simon Fraser University Burnaby, British Columbia From rf at q-leap.de Mon Mar 5 11:06:33 2012 From: rf at q-leap.de (rf at q-leap.de) Date: Mon, 5 Mar 2012 19:06:33 +0100 Subject: [torqueusers] [Patch] GPUs by the way of GRES In-Reply-To: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> References: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> Message-ID: <20309.169.908792.976630@gargle.gargle.HOWL> >>>>> "Jonathan" == Jonathan Michalon writes: Hi Jonathan, while your patch adds some functionality to count allocated GPUs as a GRES, it lacks the important functionality to tell the job which GPUs are available for it. If latest torque 2.5.x is built with GPU support, you have the option to specify a nodes spec like "-l nodes=1:gpus=1" and within the running job you can check $GPUFILE what GPUs you're allocated. Now the problem is that a job with a "-l nodes=1:gpus=1" specification won't be started with maui even if it has your patch. On the other hand, using your "-W x=GRES:gpu at 1" spec (without a "-l nodes=1:gpus=1" spec) makes the job run, but it doesn't have an idea which GPU to use. Is there an easy way to extend your patch, so that maui will make a job run with the "-l nodes=1:gpus=1" spec? Cheers, Roland Jonathan> Hi Maui folks, GPUs in Maui are a long standing Jonathan> problem. Last year a patch was sent by Mariusz Mamo?ski Jonathan> [1], which works based on GRES parameters. I've just made Jonathan> GPUs kind of working, by enhancing that patch. Please find Jonathan> attached the resulting patch, which works well for Maui Jonathan> 3.3.1. It defines a special GRES named "gpu" which works Jonathan> as expected on my test cases. Jonathan> Note that GRES behaviour seems quite confused as sometimes Jonathan> they are mentioned as consumable. This patch annihilates Jonathan> this behaviour, for the needs of GPUs. Jonathan> To use the patch: get the sources of maui-3.3.1 and patch Jonathan> them: patch -p1 < ../Patch-for-gpu-GRES.patch then compile Jonathan> as usual. Jonathan> You have to configure the GPUs in maui.cfg: Jonathan> NODECFG[nodename] GRES=gpu:2 Jonathan> Then when queuing jobs you can request GPUs with (Torque Jonathan> syntax): qsub -W x=GRES:gpu at 1 Jonathan> I hope this helps, please test this and enhance to your Jonathan> needs! Jonathan> [1] Jonathan> http://www.supercluster.org/pipermail/mauiusers/2011-April/004622.html Jonathan> Regards, Jonathan> PS. This is the second attempt to send the mail? Jonathan> -- Jonathan Michalon IT student in Strasbourg From samuel at unimelb.edu.au Mon Mar 5 20:21:01 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 06 Mar 2012 14:21:01 +1100 Subject: [torqueusers] how to (re)start mom without killing jobs? In-Reply-To: <20120305043408.GA6494@stikine.sfu.ca> References: <20120305043408.GA6494@stikine.sfu.ca> Message-ID: <4F55829D.2040705@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/03/12 15:34, Martin Siegert wrote: > We used to be able to do this by using the -p argument for the > mom, but apparently this is not working anymore: everytime I start > the mom using "pbs_mom -p" all running jobs get killed. That would be a really nasty regression, can you put in a bugzilla entry for it please to flag it up to the developers. http://www.clusterresources.com/bugzilla/ We're using cpusets on 2.4.16 and it seems to work OK.. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9Vgp0ACgkQO2KABBYQAh+xTACfZr5SygPTorUqX167svITonFa lLAAoJQFzkqfSYUWPhYEkXLfxlfzqzZc =9PNa -----END PGP SIGNATURE----- From siegert at sfu.ca Mon Mar 5 21:00:35 2012 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 5 Mar 2012 20:00:35 -0800 Subject: [torqueusers] vmem and pvmem In-Reply-To: <4F482228.8080208@anu.edu.au> References: <20120215231031.GA956@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> <4F476439.10609@fi.muni.cz> <20120224220009.GC29630@stikine.sfu.ca> <4F482228.8080208@anu.edu.au> Message-ID: <20120306040035.GA28825@stikine.sfu.ca> On Sat, Feb 25, 2012 at 10:50:00AM +1100, David Singleton wrote: > On 02/25/2012 09:00 AM, Martin Siegert wrote: > > On Fri, Feb 24, 2012 at 11:19:37AM +0100, "Mgr. ?imon T?th" wrote: > >>> Core_req vmem pvmem ulimit-v RPT > >>> ========================================= > >>> nodes=1:ppn=2 1gb 256mb 256mb 512mb > >>> procs=2 1gb 256mb 256mb 1gb > >>> nodes=1:ppn=2 1gb 4gb 1gb 4gb > >>> procs=2 1gb 4gb 1gb 4gb > >>> nodes=1:ppn=2 1gb - 1gb 512mb > >>> procs=2 1gb - 1gb 1gb > >>> > >>> So the ulimit value that influences whether a task can allocate > >>> memory, is set as the lower of the vmem and pvmem values. That > >>> makes some sense - at least more sense than taking the larger > >>> value. What doesn't make sense is allowing pvmem to be higher > >>> than vmem in the first place - in that case torque should probably > >>> reject the job or 'fix' one of the settings but leaving it as is > >>> might not be so bad, except for moab's behaviour (keep reading). > >> > >> No. The logic is as follows: > >> > >> * if pvmem (or pmem) is set > >> then set the corresponding ulimit to pvmem (pmem) value > >> > >> * if pvmem (or pmem) isn't set > >> then set the corresponding ulimit to vmem (mem) value > >> > >> Note that using pvmem is mostly pointless. On Linux this represents > >> address space, not virtual memory. > >> > >> You can use vmem as virtual memory, but even that is extremely confusing. > > > > I do not understand this comment. Both pvmem and vmem requests will > > result in RLIMIT_AS getting set. > > I disagree with vmem setting RLIMIT_AS if that is what is happening. > > > When I submit a MPI job using, e.g., procs=N, why is requesting > > pvmem=X mostly pointless? Shouldn't it be totally equivalent to > > requesting vmem=X*N ? > > > > I think we have had the discussion of what procs means on a number of > occasions (look for the thread "processes vs processors"). I believe "procs" > (now) means (virtual) processORs (most commonly, they are cores). They are not > processes. [In OpenPBS they were processes and only the UNICOS MOM supported > that limit. At least in torque-3.0.2 procs is still not properly documented > in pbs_resources* man pages.] > > pvmem sets some sort of memory limit per *process* so vmem should have nothing > to do with procs and pvmem. pvmem and vmem are pretty much orthogonal. One is > a voluntary limit the user places on their job processes (useless for actual > resource scheduling) and the other is something any well-configured system > should require a user to specify so that the resources of the system can be > managed. In particular a job with only a pvmem limit can OOM any size node > simply by spawning enough processes. > > Setting both independently (should a user choose to do so) seems perfectly > sensible. But I agree with Gareth that it only makes sense to request > vmem. Now what vmem actually is and how is should be evaluated and limited is > a whole other discussion ... You are right - my misconception was that pvmem would mean "address space per assigned processOR" and thus directly correspond to the procs request. I guess the problem is threefold: a) my perception of what torque does with pvmem and vmem requests; b) what torque actually does with pvmem and vmem requests; c) what torque should do with pvmem and vmem requests. I spent some time to figure out problem (b). I am not sure whether all of the following is right, thus please correct me ... In torque memory resources (vmem, pvmem, mem, pmem) are controlled in two ways: 1) initially by setting an appropriate rlimit before the job is started; 2) through a control mechanism that has the mom(s) periodically poll the job to determine the usage and terminate the job, if it is over the limit. I'll concentrate on vmem/pvmem (virtual memory/address space) for now. A. pvmem -------- 1) When a job requests pvmem=X bytes, the mom sets RLIMIT_AS to X before starting the job (resmom/linux/mom_mach.c, mom_set_limits). If both vmem and pvmem are requested RLIMIT_AS is set to the lesser of the two values. (if pmem is specified as well, the limit is set to the pmem value). 2) While the job is running the mom(s) check periodically whether there are processes that belong to the job that use more memory than X (resmom/linux/mom_mach.c, mom_over_limit, overmem_proc). This appears to be absolutely pointless since such processes cannot exist because of the rlimit set in (1). I.e. overmem_proc always returns false and and can be eliminated. Since RLIMIT_AS sets a per process limit nothing stops a program with a pvmem request to spawn more an more processes and potentially run a node out of memory. I am assuming that a scheduler (e.g. moab) reserves X bytes of address space for each assigned processor (not process!). Thus, there exists a discrepancy between what torque is controlling (address space per process) and what the scheduler has reserved for the job (address space per assigned core). B. vmem ------- 1) When a job requests vmem=Y bytes, the mom sets RLIMIT_AS to Y before starting the job (resmom/linux/mom_mach.c, mom_set_limits). I suspect the idea is that in principle a Np job that uses Y bytes of address space could have one process that uses Y-eps bytes of memory while the remaining Np-1 processes share the remaining eps bytes. In that respect setting RLIMIT_AS to Y (instead of Y/Np) is reasonable. However, how much memory is a scheduler reserving for such a job assuming that not all processors are assigned on the same node? I am guessing that the scheduler reserves just Y/Np address space and consequently there is a potential that nodes are oversubscribed. 2) While the job is running the mom(s) periodically sum up the address space usage of all processes that belong to the job on the node the mom is running on (resmom/linux/mom_mach.c, mom_over_limit, mem_sum). However, there is nothing in the code where the mom superior would sum up these results from each of the sister moms. At least I cannot find anything in the mom_over_limit and mem_sum routines that would do this. The consequence is that the control mechanism effectively only takes the address space used on the mom superior into account. I suspect that this is a bug/oversight. E.g., if you run a job with procs=2, vmem=3gb and each of the two processes end up using 2gb of address space, then the job will get killed if the scheduler assigns two cores on the same node. However, the job will not get killed if two processors get assigned on different nodes. Strangely enough the reporting mechanism for, e.g., qstat -f does query all moms. There is a spurious comment "only enforce cpu time and memory usage" in mom_main.c. This isn't really correct since vmem does get enforced in some strange way. I can't make sense of this ... There exists another problem with the vmem control mechanism: it does not take shared memory into account. Let's assume that a job is submitted with nodes=1:ppn=2,vmem=3gb. Initially the job starts a single process that malloc's 2gb of memory. Then the job forks and parent and child use the same 2gb of address space. Torque will add up the 2gb from parent and child and kill the job because the mem_sum routine does not check whether memory is shared between processes. I do not know how this could be done, but the current mechanism is incorrect nevertheless. What do people use when requesting memory for a shared memory job? As far as I understand neither the pvmem nor the vmem implementation makes sense to me. Particularly, as I do not understand how this can work with a scheduler that needs to reserve resources for a job. As far as I am concerned I would like to see the following: I. pvmem controls the amount of address space available to a job per assigned processor. I.e., the control process should sum up the address space of all the processes that were started by the mom initially. As far as I can tell this may not be too difficult to implement, since these processes should all have the same session id. II. vmem controls the total amount of address space for the job, i.e., the memory is added over all processes belonging to the job (not just on the mom superior). And shared memory should not be double counted. III. In the long run we may want to think about implementing different (p)vmem requests per requested processor ... Cheers, Martin From siegert at sfu.ca Mon Mar 5 21:20:08 2012 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 5 Mar 2012 20:20:08 -0800 Subject: [torqueusers] how to (re)start mom without killing jobs? In-Reply-To: <4F55829D.2040705@unimelb.edu.au> References: <20120305043408.GA6494@stikine.sfu.ca> <4F55829D.2040705@unimelb.edu.au> Message-ID: <20120306042008.GB28825@stikine.sfu.ca> On Tue, Mar 06, 2012 at 02:21:01PM +1100, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 05/03/12 15:34, Martin Siegert wrote: > > > We used to be able to do this by using the -p argument for the > > mom, but apparently this is not working anymore: everytime I start > > the mom using "pbs_mom -p" all running jobs get killed. > > That would be a really nasty regression, can you put in a bugzilla > entry for it please to flag it up to the developers. Done - bug #174 - Martin From guilherme at gf7.com.br Tue Mar 6 13:25:20 2012 From: guilherme at gf7.com.br (Guilherme Rocha) Date: Tue, 6 Mar 2012 17:25:20 -0300 Subject: [torqueusers] Success setting up a new Torque Environment in University In-Reply-To: <4E737DC0.2000307@ldeo.columbia.edu> References: <9980CD09-85BC-4967-876D-10F42B37FA20@sara.nl> <4E737DC0.2000307@ldeo.columbia.edu> Message-ID: Hello friends, I appreciate a lot all of your answers. Also I say that our Torque PBS cluster is running pretty fine. All of your tips has been very important. We are using clustalw-mpi to align sequences with success. As information only: I'm sending a small pic to you all, where we are "toasting" the processors. :) Also the script we are using to start jobs. Any new idea will very well accepted. THANKS A LOT FOLKS. (Once more) SCRIPT: #!/bin/sh #PBS -l walltime=100:00:00 #PBS -l mem=22000mb #PBS -l nodes=24:ppn=1 #PBS -j oe lamboot -v /home/rbz/Documentos/scripts/nodes mpirun -np 24 clustalw-mpi -INFILE=/home/rbz/Documentos/scripts/testecluster.fasta -OUTFILE=testeclustalw-mpi.out best regards, -- -- Guilherme Rocha GF7 Doc & Systems - Solu??es Tecnol?gicas Pesquisa e Desenvolvimento - World Wide R. Jo?o Goulart, 170 - Rio Pardo - RS - CEP 96640-000 Mobile: +55 51 81400360 - Home Page: http://www.gf7.com.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120306/ed811b8b/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 200.128.63.85 screen capture 2012-3-6-17-21-17.png Type: image/png Size: 95375 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120306/ed811b8b/attachment-0001.png From brandor5 at gmail.com Wed Mar 7 08:34:46 2012 From: brandor5 at gmail.com (Brandon Sawyers) Date: Wed, 7 Mar 2012 10:34:46 -0500 Subject: [torqueusers] policy not working as expected? Message-ID: Hello everyone: I posted this to mauiusers a couple of days ago but have seen no reply. We are bringing up a new system and are running into an issue with maui. We want jobs to behave like this. A user requests a number of nodes regardless of ppn and gets that number of nodes (nodes=6:ppn=1). At the same time, we want only one job to be running on a node at one time. So that user would get 6 nodes and no other jobs would be able to run on those nodes while that job is running. Even though that user is only using 1 of the available cores. We expected the following two config changes to make this happen. JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SINGLEJOB While only one job will run on a node like we want, but (using the example above) all 6 cores of that node are getting used, instead of using 1 core on 6 different nodes. Interestingly, the following nodes=1:ppn=1 gives me 1 core from 1 node. nodes=1:ppn=(2-6) gives me 6 cores. Something to be aware of, when I say node, I mean numa node. What are we missing? Thanks, Brandon -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120307/f53dbb18/attachment.html From rsvancara at wsu.edu Wed Mar 7 14:26:58 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Wed, 7 Mar 2012 21:26:58 +0000 Subject: [torqueusers] Job Allocation on Nodes Message-ID: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> Perhaps this question has been answered before. I have users who want to distribute jobs equally amongst nodes. What I am observing at the moment is that when a user submits a job with nodes=12:ppn=3, the job uses three nodes with 12 cores per node. Is there a way to make the job use only three cores per node. How can I prevent this or setup some kind of affinity for following the user's job requirements? I have looked node_pack, but I am unsure if this does what I need. Currently node_pack is set to false. Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120307/f55c1925/attachment.html From glen.beane at gmail.com Wed Mar 7 14:47:36 2012 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Wed, 7 Mar 2012 16:47:36 -0500 Subject: [torqueusers] Job Allocation on Nodes In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> Message-ID: <962C352D-A732-4C72-BA74-0F68EA4492A6@gmail.com> What scheduler are you using? If you are using Moab or Maui you want the EXACTNODE node match policy instead of the default EXACTPROC Sent from my iPhone On Mar 7, 2012, at 4:26 PM, "Svancara, Randall" wrote: > Perhaps this question has been answered before. I have users who want to distribute jobs equally amongst nodes. What I am observing at the moment is that when a user submits a job with nodes=12:ppn=3, the job uses three nodes with 12 cores per node. Is there a way to make the job use only three cores per node. How can I prevent this or setup some kind of affinity for following the user?s job requirements? > > I have looked node_pack, but I am unsure if this does what I need. Currently node_pack is set to false. > > Thanks, > > Randall Svancara > High Performance Computing Systems Administrator > Washington State University > 509-335-3039 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120307/af8be78c/attachment.html From rsvancara at wsu.edu Wed Mar 7 14:48:42 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Wed, 7 Mar 2012 21:48:42 +0000 Subject: [torqueusers] Job Allocation on Nodes In-Reply-To: <962C352D-A732-4C72-BA74-0F68EA4492A6@gmail.com> References: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> <962C352D-A732-4C72-BA74-0F68EA4492A6@gmail.com> Message-ID: <1F880D7A2494B346B5AB96481EAE704A123E8D@EXMB-03.ad.wsu.edu> We are using Moab. So that sounds like a place to start. Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of glen.beane at gmail.com Sent: Wednesday, March 07, 2012 1:48 PM To: Torque Users Mailing List Cc: Torque Users Mailing List (torqueusers at supercluster.org) Subject: Re: [torqueusers] Job Allocation on Nodes What scheduler are you using? If you are using Moab or Maui you want the EXACTNODE node match policy instead of the default EXACTPROC Sent from my iPhone On Mar 7, 2012, at 4:26 PM, "Svancara, Randall" > wrote: Perhaps this question has been answered before. I have users who want to distribute jobs equally amongst nodes. What I am observing at the moment is that when a user submits a job with nodes=12:ppn=3, the job uses three nodes with 12 cores per node. Is there a way to make the job use only three cores per node. How can I prevent this or setup some kind of affinity for following the user?s job requirements? I have looked node_pack, but I am unsure if this does what I need. Currently node_pack is set to false. Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120307/122c0a02/attachment-0001.html From Gareth.Williams at csiro.au Wed Mar 7 17:30:43 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 8 Mar 2012 11:30:43 +1100 Subject: [torqueusers] Job Allocation on Nodes In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102D593A194@exvic-mbx04.nexus.csiro.au> > Perhaps this question has been answered before.? I have users who want to distribute jobs equally amongst nodes.? What I am observing at the moment is that when a user submits a job with nodes=12:ppn=3, the job uses three nodes with 12 cores per node.? Is there a way to make the job use only three cores per node.? How can I prevent this or setup some kind of affinity for following the user's job requirements? Hi Randall, Why would you want to do such a thing? If the user submits four of the jobs they will align, and you will get worse contention. I would suggest: if you need to spread jobs to access memory then you should schedule memory and/or if you need to avoid contention, say for memory bandwidth, then get the users to request whole nodes (all the available ppn) and only run as many processes as their scaling permits (they may need custom mpirun options). Gareth From rsvancara at wsu.edu Wed Mar 7 17:40:33 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Thu, 8 Mar 2012 00:40:33 +0000 Subject: [torqueusers] Job Allocation on Nodes In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102D593A194@exvic-mbx04.nexus.csiro.au> References: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> <007DECE986B47F4EABF823C1FBB19C620102D593A194@exvic-mbx04.nexus.csiro.au> Message-ID: <1F880D7A2494B346B5AB96481EAE704A123FA5@EXMB-03.ad.wsu.edu> Hi, Basically for the reason you described, prevent users from over subscribing a node in term of memory. I am still working to get a better handling on the scheduling jobs. Perhaps I need to look at the -l mem flag? If I say I need five nodes, with 24GB of RAM per node, will -l mem=24GB give me a five nodes with 1 core and 24GB of RAM. At this point I have been using nodes and ppn to regulate how much runs on each node, but I admit, it is problematic as there is no guarantee that someone else will not use the same node. Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gareth.Williams at csiro.au Sent: Wednesday, March 07, 2012 4:31 PM To: torqueusers at supercluster.org Subject: Re: [torqueusers] Job Allocation on Nodes > Perhaps this question has been answered before.? I have users who want to distribute jobs equally amongst nodes.? What I am observing at the moment is that when a user submits a job with nodes=12:ppn=3, the job uses three nodes with 12 cores per node.? Is there a way to make the job use only three cores per node.? How can I prevent this or setup some kind of affinity for following the user's job requirements? Hi Randall, Why would you want to do such a thing? If the user submits four of the jobs they will align, and you will get worse contention. I would suggest: if you need to spread jobs to access memory then you should schedule memory and/or if you need to avoid contention, say for memory bandwidth, then get the users to request whole nodes (all the available ppn) and only run as many processes as their scaling permits (they may need custom mpirun options). Gareth _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From bill at Princeton.EDU Wed Mar 7 17:47:31 2012 From: bill at Princeton.EDU (Bill Wichser) Date: Wed, 07 Mar 2012 19:47:31 -0500 Subject: [torqueusers] Job Allocation on Nodes In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102D593A194@exvic-mbx04.nexus.csiro.au> References: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> <007DECE986B47F4EABF823C1FBB19C620102D593A194@exvic-mbx04.nexus.csiro.au> Message-ID: <4F5801A3.9030308@princeton.edu> On 3/7/2012 7:30 PM, Gareth.Williams at csiro.au wrote: >> Perhaps this question has been answered before. I have users who want to distribute jobs equally amongst nodes. What I am observing at the moment is that when a user submits a job with nodes=12:ppn=3, the job uses three nodes with 12 cores per node. Is there a way to make the job use only three cores per node. How can I prevent this or setup some kind of affinity for following the user's job requirements? > Hi Randall, > > Why would you want to do such a thing? If the user submits four of the jobs they will align, and you will get worse contention. I would suggest: if you need to spread jobs to access memory then you should schedule memory and/or if you need to avoid contention, say for memory bandwidth, then get the users to request whole nodes (all the available ppn) and only run as many processes as their scaling permits (they may need custom mpirun options). > > Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers I have users who desire to do just this -- maximize memory bandwidth for their application. It turns out that sharing the node with others always provides better memory bandwidth than running the full node with the job. This can be reproduced quantitatively while looking at walltime only. Sometimes allocating multiple cores to cover memory use is required but the --bynode flag for openmpi is always used. So memory contention is overcome and the node can be shared even with the same user's jobs as this contention tends to run in cycles instead of overlapping. Bill From jagga13 at gmail.com Thu Mar 8 09:53:46 2012 From: jagga13 at gmail.com (Jagga Soorma) Date: Thu, 8 Mar 2012 08:53:46 -0800 Subject: [torqueusers] Request unlimited walltime for jobs Message-ID: Hi Guys, I have a default of 1 hr setup for my walltime but in some cases the users would like to request a unlimited walltime for their jobs. Is there a way to do this using "-l walltime={something}" instead of me changing the global walltime value for my queue? Thanks, -Jagga -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120308/7d193bc5/attachment.html From sm4082 at nyu.edu Thu Mar 8 10:03:15 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 8 Mar 2012 12:03:15 -0500 Subject: [torqueusers] Request unlimited walltime for jobs In-Reply-To: References: Message-ID: <91AEA379-D5D8-4508-8852-889E2B280DDF@nyu.edu> Hi Jagga, First, ask them to submit jobs with the maximum walltime possible. Once they submit, use qalter to alter the walltime to whatever you want to give them. This way you don't have to change any queue setting. Best, Sreedhar. On Mar 8, 2012, at 11:53 AM, Jagga Soorma wrote: > Hi Guys, > > I have a default of 1 hr setup for my walltime but in some cases the users would like to request a unlimited walltime for their jobs. Is there a way to do this using "-l walltime={something}" instead of me changing the global walltime value for my queue? > > Thanks, > -Jagga > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jrosenquist at adaptivecomputing.com Thu Mar 8 10:32:53 2012 From: jrosenquist at adaptivecomputing.com (John Rosenquist) Date: Thu, 08 Mar 2012 10:32:53 -0700 Subject: [torqueusers] command line -q option Message-ID: <4F58ED45.1040509@adaptivecomputing.com> This is John Rosenquist, I'm one of the developers at Adaptive Computing working on torque. I was wondering if anyone uses the -q option on any of the commands (pbsnodes, etc). The purpose is to suppress all output from the command. I would like to get rid of it. Please let me know if anyone is using this feature. John. From tbaer at utk.edu Thu Mar 8 10:40:50 2012 From: tbaer at utk.edu (Troy Baer) Date: Thu, 8 Mar 2012 12:40:50 -0500 Subject: [torqueusers] [torquedev] command line -q option In-Reply-To: <4F58ED45.1040509@adaptivecomputing.com> References: <4F58ED45.1040509@adaptivecomputing.com> Message-ID: <1331228450.5702.479.camel@browncoat.jics.utk.edu> On Thu, 2012-03-08 at 10:32 -0700, John Rosenquist wrote: > This is John Rosenquist, I'm one of the developers at Adaptive Computing > working on torque. > > I was wondering if anyone uses the -q option on any of the commands > (pbsnodes, etc). The purpose is to suppress all output from the command. > > I would like to get rid of it. > > Please let me know if anyone is using this feature. I think you're going to have to be more specific about exactly which commands you mean. (For instance, if you remove the -q option from qsub, you may have a riot on your hands...) --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From gas5x at yahoo.com Thu Mar 8 10:41:57 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Thu, 8 Mar 2012 09:41:57 -0800 (PST) Subject: [torqueusers] command line -q option In-Reply-To: <4F58ED45.1040509@adaptivecomputing.com> Message-ID: <1331228517.83509.YahooMailClassic@web111304.mail.gq1.yahoo.com> Hi Jonh, As a user, I often used 'qstat -q' to query which queues are available on a given cluster. I guess it is different from -q option for pbsnodes etc., where it is not much used. -- Grigory Shamov University of Manitoba --- On Thu, 3/8/12, John Rosenquist wrote: > From: John Rosenquist > Subject: [torqueusers] command line -q option > To: "Torque Users Mailing List" , "Torque Developers mailing list" > Date: Thursday, March 8, 2012, 9:32 AM > This is John Rosenquist, I'm one of > the developers at Adaptive Computing > working on torque. > > I was wondering if anyone uses the -q option on any of the > commands > (pbsnodes, etc). The purpose is to suppress all output from > the command. > > I would like to get rid of it. > > Please let me know if anyone is using this feature. > > John. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From christina.salls at noaa.gov Thu Mar 8 12:31:00 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 8 Mar 2012 14:31:00 -0500 Subject: [torqueusers] Scheduler bound to ETHO IP port In-Reply-To: <654f80c7-2933-404e-9674-935553843683@mail> References: <654f80c7-2933-404e-9674-935553843683@mail> Message-ID: Thanks Rick! Sorry for the delayed response. I am just returning from vacation and catching up with email! This looks like what I need. I do not have a torque.cfg file in my /var/spool/torque directory, but I assume I can just create it. On Fri, Feb 17, 2012 at 5:54 PM, Rick McKay wrote: > Christina, > > I think you're looking for this: > > From 2.5.9 CHANGELOG file: > e - Added new option to torque.cfg name TRQ_IFNAME. This allows the user > to designate a preferred outbound interface for TORQUE requests. The > interface is the name of the NIC interface, for example eth0. > > Reference that parameter and QSUBHOST in Appendix K. > > --Rick > > Rick McKay | Technical Support Engineer > rmckay at adaptivecomputing.com > Direct: (801) 717-3395 | Toll free: 1-888-221-2008 x3395 > Adaptive Computing | www.adaptivecomputing.com > > ------------------------------ > *From: *"Christina Salls" > *To: *"Torque Users Mailing List" , > "Michael Saxon" , "Frank Indiviglio" < > frank.indiviglio at noaa.gov>, "Craig Tierney" , > "help >> GLERL IT Help" , "Jeff Hanson" < > jhanson at sgi.com>, "Brian Beagan" , "John Cardenas" < > cardenas at sgi.com> > *Sent: *Friday, February 17, 2012 2:07:47 PM > > *Subject: *[torqueusers] Scheduler bound to ETHO IP port > > Hi all, > > I have been experiencing a problem with jobs staying in my default > queue until I force execution with a qrun. It turns out that the reason is > that my torque server is configured on my second ethernet interface which > is connected to my compute nodes. The problem is that the scheduler is > bound to the 1st interface port. > > [root at wings server_logs]# ps -ef | grep pbs > root 1268 1 0 13:56 ? 00:00:00 /usr/local/sbin/pbs_server > -d /var/spool/torque -H admin.default.domain > root 14768 1 0 14:25 ? 00:00:00 /usr/local/sbin/pbs_sched > -d /var/spool/torque > root 21956 16623 0 14:41 pts/25 00:00:00 grep pbs > [root at wings server_logs]# lsof -p 14768 > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > pbs_sched 14768 root cwd DIR 8,98 4096 6032970 > /var/spool/torque/sched_priv > pbs_sched 14768 root rtd DIR 8,98 4096 2 / > pbs_sched 14768 root txt REG 8,98 268782 3421344 > /usr/local/sbin/pbs_sched > pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ > ld-2.12.so > pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ > libc-2.12.so > pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ > libnss_files-2.12.so > pbs_sched 14768 root mem REG 8,98 791107 3418524 > /usr/local/lib/libtorque.so.2.0.0 > pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null > pbs_sched 14768 root 1w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out > pbs_sched 14768 root 2w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out > pbs_sched 14768 root 3w REG 8,98 2699 6033359 > /var/spool/torque/sched_logs/20120217 > pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP > wings.glerl.noaa.gov:15004 (LISTEN) > pbs_sched 14768 root 5wW REG 8,98 7 6033329 > /var/spool/torque/sched_priv/sched.lock > pbs_sched 14768 root 6r REG 8,98 4374 6032952 > /var/spool/torque/sched_priv/resource_group > pbs_sched 14768 root 7w REG 8,98 0 6033360 > /var/spool/torque/sched_priv/accounting/20120217 > [root at wings server_logs]# cd .. > [root at wings torque]# ls > aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs > sched_priv server_logs server_name server_priv spool undelivered > [root at wings torque]# lsof -n -p 14768 > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > pbs_sched 14768 root cwd DIR 8,98 4096 6032970 > /var/spool/torque/sched_priv > pbs_sched 14768 root rtd DIR 8,98 4096 2 / > pbs_sched 14768 root txt REG 8,98 268782 3421344 > /usr/local/sbin/pbs_sched > pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ > ld-2.12.so > pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ > libc-2.12.so > pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ > libnss_files-2.12.so > pbs_sched 14768 root mem REG 8,98 791107 3418524 > /usr/local/lib/libtorque.so.2.0.0 > pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null > pbs_sched 14768 root 1w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out > pbs_sched 14768 root 2w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out > pbs_sched 14768 root 3w REG 8,98 2699 6033359 > /var/spool/torque/sched_logs/20120217 > pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP > 192.94.173.9:15004 (LISTEN) > pbs_sched 14768 root 5wW REG 8,98 7 6033329 > /var/spool/torque/sched_priv/sched.lock > pbs_sched 14768 root 6r REG 8,98 4374 6032952 > /var/spool/torque/sched_priv/resource_group > pbs_sched 14768 root 7w REG 8,98 0 6033360 > /var/spool/torque/sched_priv/accounting/20120217 > [root at wings torque]# ls > aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs > sched_priv server_logs server_name server_priv spool undelivered > [root at wings torque]# cd sched_priv > [root at wings sched_priv]# ls > accounting dedicated_time holidays resource_group sched_config > sched.lock sched_out > [root at wings sched_priv]# more sched_config > > When I used hostname to change the name to the admin.default.domain, and > restarted the pbs_sched daemon, everything started working. > > Any idea how to change the hostname/IP/interface that the scheduler uses? > > Thanks, > > Christina > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > **** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120308/0911268d/attachment.html From christina.salls at noaa.gov Thu Mar 8 12:31:49 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 8 Mar 2012 14:31:49 -0500 Subject: [torqueusers] Scheduler bound to ETHO IP port In-Reply-To: <242421BFAF465844BE24EB90BB97E2210197741C@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210197741C@ITSDAG1D.its.iastate.edu> Message-ID: Thanks James! On Mon, Feb 20, 2012 at 11:43 AM, Coyle, James J [ITACD] wrote: > Cristina,**** > > ** ** > > I think that it is common to use two interfaces on the login node, one > inward facing on a private subnet and **** > > one outward facing, and place the internal interface name in > /var/spool/torque/server_name .**** > > Make sure that**** > > ** ** > > What I always do is to use /etc/hosts and insert a line like:**** > > ** ** > > 172.16.10.1 loginnode admin admin.default.domain**** > > ** ** > > and copy /etc/host through the compute nodes.**** > > ** ** > > You will also want to make sure that**** > > files **** > > precedes **** > > dns**** > > in /etc/nsswitch.conf**** > > ** ** > > Then I can use the internal name.**** > > ** ** > > **- **Jim C.**** > > ** ** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Christina Salls > *Sent:* Friday, February 17, 2012 3:08 PM > *To:* Torque Users Mailing List; Michael Saxon; Frank Indiviglio; Craig > Tierney; help >> GLERL IT Help; Jeff Hanson; Brian Beagan; John Cardenas > *Subject:* [torqueusers] Scheduler bound to ETHO IP port**** > > ** ** > > Hi all,**** > > ** ** > > I have been experiencing a problem with jobs staying in my default > queue until I force execution with a qrun. It turns out that the reason is > that my torque server is configured on my second ethernet interface which > is connected to my compute nodes. The problem is that the scheduler is > bound to the 1st interface port. **** > > ** ** > > [root at wings server_logs]# ps -ef | grep pbs**** > > root 1268 1 0 13:56 ? 00:00:00 /usr/local/sbin/pbs_server > -d /var/spool/torque -H admin.default.domain**** > > root 14768 1 0 14:25 ? 00:00:00 /usr/local/sbin/pbs_sched > -d /var/spool/torque**** > > root 21956 16623 0 14:41 pts/25 00:00:00 grep pbs**** > > [root at wings server_logs]# lsof -p 14768**** > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME**** > > pbs_sched 14768 root cwd DIR 8,98 4096 6032970 > /var/spool/torque/sched_priv**** > > pbs_sched 14768 root rtd DIR 8,98 4096 2 /**** > > pbs_sched 14768 root txt REG 8,98 268782 3421344 > /usr/local/sbin/pbs_sched**** > > pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ > ld-2.12.so**** > > pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ > libc-2.12.so**** > > pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ > libnss_files-2.12.so**** > > pbs_sched 14768 root mem REG 8,98 791107 3418524 > /usr/local/lib/libtorque.so.2.0.0**** > > pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null**** > > pbs_sched 14768 root 1w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out**** > > pbs_sched 14768 root 2w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out**** > > pbs_sched 14768 root 3w REG 8,98 2699 6033359 > /var/spool/torque/sched_logs/20120217**** > > pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP > wings.glerl.noaa.gov:15004 (LISTEN)**** > > pbs_sched 14768 root 5wW REG 8,98 7 6033329 > /var/spool/torque/sched_priv/sched.lock**** > > pbs_sched 14768 root 6r REG 8,98 4374 6032952 > /var/spool/torque/sched_priv/resource_group**** > > pbs_sched 14768 root 7w REG 8,98 0 6033360 > /var/spool/torque/sched_priv/accounting/20120217**** > > [root at wings server_logs]# cd ..**** > > [root at wings torque]# ls**** > > aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs > sched_priv server_logs server_name server_priv spool undelivered**** > > [root at wings torque]# lsof -n -p 14768**** > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME**** > > pbs_sched 14768 root cwd DIR 8,98 4096 6032970 > /var/spool/torque/sched_priv**** > > pbs_sched 14768 root rtd DIR 8,98 4096 2 /**** > > pbs_sched 14768 root txt REG 8,98 268782 3421344 > /usr/local/sbin/pbs_sched**** > > pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ > ld-2.12.so**** > > pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ > libc-2.12.so**** > > pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ > libnss_files-2.12.so**** > > pbs_sched 14768 root mem REG 8,98 791107 3418524 > /usr/local/lib/libtorque.so.2.0.0**** > > pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null**** > > pbs_sched 14768 root 1w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out**** > > pbs_sched 14768 root 2w REG 8,98 0 6033331 > /var/spool/torque/sched_priv/sched_out**** > > pbs_sched 14768 root 3w REG 8,98 2699 6033359 > /var/spool/torque/sched_logs/20120217**** > > pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP > 192.94.173.9:15004 (LISTEN)**** > > pbs_sched 14768 root 5wW REG 8,98 7 6033329 > /var/spool/torque/sched_priv/sched.lock**** > > pbs_sched 14768 root 6r REG 8,98 4374 6032952 > /var/spool/torque/sched_priv/resource_group**** > > pbs_sched 14768 root 7w REG 8,98 0 6033360 > /var/spool/torque/sched_priv/accounting/20120217**** > > [root at wings torque]# ls**** > > aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs > sched_priv server_logs server_name server_priv spool undelivered**** > > [root at wings torque]# cd sched_priv**** > > [root at wings sched_priv]# ls**** > > accounting dedicated_time holidays resource_group sched_config > sched.lock sched_out**** > > [root at wings sched_priv]# more sched_config**** > > ** ** > > When I used hostname to change the name to the admin.default.domain, and > restarted the pbs_sched daemon, everything started working. **** > > ** ** > > Any idea how to change the hostname/IP/interface that the scheduler uses?* > *** > > ** ** > > Thanks,**** > > ** ** > > Christina**** > > ** ** > > -- > Christina A. Salls**** > > GLERL Computer Group**** > > help.glerl at noaa.gov**** > > Help Desk x2127**** > > Christina.Salls at noaa.gov**** > > Voice Mail 734-741-2446 **** > > ** ** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120308/f75e7d07/attachment-0001.html From Gareth.Williams at csiro.au Thu Mar 8 13:33:31 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 9 Mar 2012 07:33:31 +1100 Subject: [torqueusers] Job Allocation on Nodes In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A123FA5@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> <007DECE986B47F4EABF823C1FBB19C620102D593A194@exvic-mbx04.nexus.csiro.au> <1F880D7A2494B346B5AB96481EAE704A123FA5@EXMB-03.ad.wsu.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102D7AE2E61@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Svancara, Randall [mailto:rsvancara at wsu.edu] > Sent: Thursday, 8 March 2012 11:41 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Job Allocation on Nodes > > Hi, > > Basically for the reason you described, prevent users from over > subscribing a node in term of memory. I am still working to get a > better handling on the scheduling jobs. Perhaps I need to look at > the -l mem flag? If I say I need five nodes, with 24GB of RAM per > node, will -l mem=24GB give me a five nodes with 1 core and 24GB of > RAM. At this point I have been using nodes and ppn to regulate how > much runs on each node, but I admit, it is problematic as there is no > guarantee that someone else will not use the same node. Hi Randall, I'd look at -l vmem rather than mem. vmem is whole-of-job so for exclusive access to 24GB nodes (because all the memory would be dedicated) you could have requests like -l nodes=12:ppn=3,vmem=288GB and -l nodes=5:ppn=1,vmem=120GB. Gareth > > Thanks, > > Randall Svancara > High Performance Computing Systems Administrator > Washington State University > 509-335-3039 > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > bounces at supercluster.org] On Behalf Of Gareth.Williams at csiro.au > Sent: Wednesday, March 07, 2012 4:31 PM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] Job Allocation on Nodes > > > Perhaps this question has been answered before.? I have users who > want to distribute jobs equally amongst nodes.? What I am observing at > the moment is that when a user submits a job with nodes=12:ppn=3, the > job uses three nodes with 12 cores per node.? Is there a way to make > the job use only three cores per node.? How can I prevent this or setup > some kind of affinity for following the user's job requirements? > > Hi Randall, > > Why would you want to do such a thing? If the user submits four of the > jobs they will align, and you will get worse contention. I would > suggest: if you need to spread jobs to access memory then you should > schedule memory and/or if you need to avoid contention, say for memory > bandwidth, then get the users to request whole nodes (all the available > ppn) and only run as many processes as their scaling permits (they may > need custom mpirun options). > > Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Thu Mar 8 13:59:31 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 9 Mar 2012 07:59:31 +1100 Subject: [torqueusers] Job Allocation on Nodes In-Reply-To: <4F5801A3.9030308@princeton.edu> References: <1F880D7A2494B346B5AB96481EAE704A123E29@EXMB-03.ad.wsu.edu> <007DECE986B47F4EABF823C1FBB19C620102D593A194@exvic-mbx04.nexus.csiro.au> <4F5801A3.9030308@princeton.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102D7AE2E62@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Bill Wichser [mailto:bill at Princeton.EDU] > Sent: Thursday, 8 March 2012 11:48 AM > To: Torque Users Mailing List > Cc: Williams, Gareth (CSIRO IM&T, Docklands) > Subject: Re: [torqueusers] Job Allocation on Nodes > > On 3/7/2012 7:30 PM, Gareth.Williams at csiro.au wrote: > >> Perhaps this question has been answered before. I have users who > want to distribute jobs equally amongst nodes. What I am observing at > the moment is that when a user submits a job with nodes=12:ppn=3, the > job uses three nodes with 12 cores per node. Is there a way to make > the job use only three cores per node. How can I prevent this or setup > some kind of affinity for following the user's job requirements? > > Hi Randall, > > > > Why would you want to do such a thing? If the user submits four of > the jobs they will align, and you will get worse contention. I would > suggest: if you need to spread jobs to access memory then you should > schedule memory and/or if you need to avoid contention, say for memory > bandwidth, then get the users to request whole nodes (all the available > ppn) and only run as many processes as their scaling permits (they may > need custom mpirun options). > > > > Gareth > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > I have users who desire to do just this -- maximize memory bandwidth > for > their application. It turns out that sharing the node with others > always provides better memory bandwidth than running the full node with > the job. This can be reproduced quantitatively while looking at > walltime only. Sometimes allocating multiple cores to cover memory use > is required but the --bynode flag for openmpi is always used. > > So memory contention is overcome and the node can be shared even with > the same user's jobs as this contention tends to run in cycles instead > of overlapping. > > Bill Thanks Bill, I think it is worthwhile having such postings on the list. I agree as a generalization that we gain efficiency through overlapping demand - after all that is part of having a big cluster rather than many separate clusters. However, the key to HPC is allocating (dedicated) resources to jobs, but we have to choose a granularity of how dedicated is dedicated. Dedicating cpus/cores is sort of easy, but getting harder with multi-core. I'd suggest that dedicating memory bandwidth (and memory itself) is next most important. I'm resigned to have to share (bandwidth to) network and storage though that can be moderated by layout of multi-process jobs. If your site gets useful efficiency by spreading jobs to overlap memory bandwidth utilization then that is a good solution for you. We prefer a more conservative approach where the there is less scope for jobs to impact on one another. This is a choice that the cluster manager or support team need to make and it helps to have this information available to inform such decisions. Cheers, Gareth From jrosenquist at adaptivecomputing.com Thu Mar 8 14:19:11 2012 From: jrosenquist at adaptivecomputing.com (John Rosenquist) Date: Thu, 8 Mar 2012 14:19:11 -0700 Subject: [torqueusers] [torquedev] command line -q option In-Reply-To: <1331228450.5702.479.camel@browncoat.jics.utk.edu> References: <4F58ED45.1040509@adaptivecomputing.com> <1331228450.5702.479.camel@browncoat.jics.utk.edu> Message-ID: Doh. My bad. Let me be more specific. I'm only looking at removing the ones where -q means quiet. pestat -q in contrib pbsnodes -q (qnodes being an alias to pbsnodes) tracejob Thanks, John. On Thu, Mar 8, 2012 at 10:40 AM, Troy Baer wrote: > On Thu, 2012-03-08 at 10:32 -0700, John Rosenquist wrote: > > This is John Rosenquist, I'm one of the developers at Adaptive Computing > > working on torque. > > > > I was wondering if anyone uses the -q option on any of the commands > > (pbsnodes, etc). The purpose is to suppress all output from the command. > > > > I would like to get rid of it. > > > > Please let me know if anyone is using this feature. > > I think you're going to have to be more specific about exactly which > commands you mean. (For instance, if you remove the -q option from > qsub, you may have a riot on your hands...) > > --Troy > -- > Troy Baer, Senior HPC System Administrator > National Institute for Computational Sciences, University of Tennessee > http://www.nics.tennessee.edu/ > Phone: 865-241-4233 > > > _______________________________________________ > torquedev mailing list > torquedev at supercluster.org > http://www.supercluster.org/mailman/listinfo/torquedev > -- -- John Rosenquist | Torque Developer Direct Line: 801.341.4629 | Fax: 801.717.3738 1656 S. East Bay Blvd. Suite #300 | Provo, Utah 84601 | USA Adaptive Computing, Ent. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120308/fbabdd07/attachment.html From knielson at adaptivecomputing.com Thu Mar 8 14:45:21 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 8 Mar 2012 14:45:21 -0700 Subject: [torqueusers] 2.5.11 release candidate Message-ID: Hi all, There is a release candidate for 2.5.11 now available for download. Please try this and let us know if you find any problems. You can download the tar ball at http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/ torque-2.5.11-snap.201203081434.tar.gz Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120308/0e7c9b5b/attachment.html From Gareth.Williams at csiro.au Thu Mar 8 15:57:08 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 9 Mar 2012 09:57:08 +1100 Subject: [torqueusers] [torquedev] command line -q option In-Reply-To: References: <4F58ED45.1040509@adaptivecomputing.com> <1331228450.5702.479.camel@browncoat.jics.utk.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102D7AE2E65@exvic-mbx04.nexus.csiro.au> It seems to me (without looking at code) that with tracejob, -q is effectively the same as redirecting stderr to /dev/null. So we would not be losing anything that couldn't be easily done anyway. Mostly I don't think I care. Gareth From: torquedev-bounces at supercluster.org [mailto:torquedev-bounces at supercluster.org] On Behalf Of John Rosenquist Sent: Friday, 9 March 2012 8:19 AM To: Torque Developers mailing list Cc: Torque Users Mailing List Subject: Re: [torquedev] command line -q option Doh. My bad. Let me be more specific. I'm only looking at removing the ones where -q means quiet. pestat -q in contrib pbsnodes -q (qnodes being an alias to pbsnodes) tracejob Thanks, John. On Thu, Mar 8, 2012 at 10:40 AM, Troy Baer > wrote: On Thu, 2012-03-08 at 10:32 -0700, John Rosenquist wrote: > This is John Rosenquist, I'm one of the developers at Adaptive Computing > working on torque. > > I was wondering if anyone uses the -q option on any of the commands > (pbsnodes, etc). The purpose is to suppress all output from the command. > > I would like to get rid of it. > > Please let me know if anyone is using this feature. I think you're going to have to be more specific about exactly which commands you mean. (For instance, if you remove the -q option from qsub, you may have a riot on your hands...) --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 _______________________________________________ torquedev mailing list torquedev at supercluster.org http://www.supercluster.org/mailman/listinfo/torquedev -- -- John Rosenquist | Torque Developer Direct Line: 801.341.4629 | Fax: 801.717.3738 1656 S. East Bay Blvd. Suite #300 | Provo, Utah 84601 | USA Adaptive Computing, Ent. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120309/66f09c3c/attachment-0001.html From tfischer at dc.uba.ar Wed Mar 7 11:33:27 2012 From: tfischer at dc.uba.ar (Thomas Fischer) Date: Wed, 7 Mar 2012 15:33:27 -0300 Subject: [torqueusers] request type QueueJob from host hulk rejected (host not authorized) Message-ID: Hi all, I ambuilduing up a new cluster running debian lenny, and i decided to switch to torque. Until now I just manged to do a first install of torque (version 2.4.8 from lenny-backports repo) and Maui (3.3.1 from source) on the server (called hulk), and torque-mom on one execution node (called nodo-32). I followed the guide on debianclusters.org to do so. Everything seemed to be working, services are running, etc., but when i try to submit a test job (echo "sleep 30") with a user, the job is queued and deferred by maui. Here are what i consider relevant outputs: -------------------------------------------------- tfischer at hulk:~$ echo "sleep 30" | qsub 13.hulk -------------------------------------------------- tfischer at hulk:~$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 13.hulk STDIN tfischer 0 Q main.queue -------------------------------------------------- root at hulk:~# qrun -H nodo-32 13 qrun: Execution server rejected request MSG=cannot send job to mom, state=PRERUN 13.hulk -------------------------------------------------- tfischer at hulk:~$ /usr/local/maui/bin/checkjob 13 checking job 13 State: Idle EState: Deferred Creds: user:tfischer group:tfischer class:main.queue qos:DEFAULT WallTime: 00:00:00 of 1:00:00 SubmitTime: Wed Mar 7 15:08:05 (Time Queued Total: 00:15:13 Eligible: 00:00:02) StartDate: -00:15:10 Wed Mar 7 15:08:08 Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN') Holds: Defer (hold reason: RMFailure) PE: 1.00 StartPriority: 1 cannot select job 13 for partition DEFAULT (job hold active) -------------------------------------------------- root at hulk:~# pbsnodes -a nodo-32 state = free np = 16 ntype = cluster status = opsys=linux,uname=Linux nodo-32 2.6.26.x3550m3 #1 SMP Mon Jan 23 11:51:03 ART 2012 x86_64,sessions=5677,nsessions=1,nusers=1,idletime=164255,totmem=24817844kb,availmem=24725496kb,physmem=16431924kb,ncpus=16,loadave=0.00,netload=26551478,state=free,jobs=,varattr=,rectime=1331144485 nodo-33 state = down np = 1 ntype = cluster -------------------------------------------------- from hulk:/var/spool/torque/server_logs/20120307 hulk PBS_Server: LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 13.hulk -------------------------------------------------- from nodo-32:/var/spool/torque/mom_logs/20120307 pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server at hulk -------------------------------------------------- seems like the node is rejecting jobs from the server. The server name is defined at the host like nodo-32:~# cat /var/spool/torque/server_name hulk Is there something i am forgetting about or missconfiguring? Thanks in advance, Thomas Fischer -- restate my assumptions: 1. Mathematics is the language of nature. 2. Everything around us can be represented and understood through numbers. 3. If you graph these numbers, patterns emerge. Therefore: There are patterns everywhere in nature. Max Cohen, PI From samuel at unimelb.edu.au Sun Mar 11 18:46:38 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 12 Mar 2012 11:46:38 +1100 Subject: [torqueusers] [torquedev] command line -q option In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102D7AE2E65@exvic-mbx04.nexus.csiro.au> References: <4F58ED45.1040509@adaptivecomputing.com> <1331228450.5702.479.camel@browncoat.jics.utk.edu> <007DECE986B47F4EABF823C1FBB19C620102D7AE2E65@exvic-mbx04.nexus.csiro.au> Message-ID: <4F5D476E.8040305@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/03/12 09:57, Gareth.Williams at csiro.au wrote: > It seems to me (without looking at code) that with tracejob, -q is > effectively the same as redirecting stderr to /dev/null. So we > would not be losing anything that couldn?t be easily done anyway. Most of the "errors" tracejob emits are not really errors, they're just warning about files you wouldn't expect to be there in the first place. I'd rather it just didn't warn about them in the first place. :-) - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9dR24ACgkQO2KABBYQAh8jQwCdFi8m3W2ItdddgG5ntbVWdnVP 9H8AnRdDglePA3WAIWaPdrGvP39HmhTp =Gjt7 -----END PGP SIGNATURE----- From vner75 at gmail.com Mon Mar 12 04:37:58 2012 From: vner75 at gmail.com (Vahe nr) Date: Mon, 12 Mar 2012 14:37:58 +0400 Subject: [torqueusers] Maui conf question Message-ID: Hi all I would like to ask a question about putting some restriction on the queue for using restrict number of processor on the node. In more detail I have a Cluster each node has 8 cores, I have a queue name myqueue how can I tell maui (or the server) to send the jobs of this queue on four processors only on each node, in other word the job uses only four processor on each node. Thanks in advance for any suggestion and help. Regards, Vahe -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120312/529d935b/attachment.html From knielson at adaptivecomputing.com Mon Mar 12 10:38:33 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 12 Mar 2012 10:38:33 -0600 Subject: [torqueusers] Does any use Condor Message-ID: Hi all, Does anyone out there currently use or used in the past Condor? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120312/5c935276/attachment.html From ramon.bastiaans at sara.nl Tue Mar 13 09:05:17 2012 From: ramon.bastiaans at sara.nl (Ramon Bastiaans) Date: Tue, 13 Mar 2012 16:05:17 +0100 Subject: [torqueusers] [torquedev] 2.5.11 release candidate In-Reply-To: References: Message-ID: <4F5F622D.50709@sara.nl> Hi, I have a little remark. I read in the Changelog that this bug: * http://www.clusterresources.com/bugzilla/show_bug.cgi?id=168 Should be fixed now in 2.5.11. This is great and thanks for that. However; it would be nice (for me the bug reporter) if you guys would actually log/close this in the Bugzilla ticket. So that I may know I should download 2.5.11 ;) Kind regards, - Ramon. On 8-3-2012 22:45, Ken Nielson wrote: > Hi all, > > There is a release candidate for 2.5.11 now available for download. > Please try this and let us know if you find any problems. > > You can download the tar ball at > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-2.5.11-snap.201203081434.tar.gz > > > Regards > > Ken Nielson > Adaptive Computing -- ing. R. Bastiaans, B.ICT * Senior Systems Programmer * Operations, Support and Development SARA Science Park 140 PO Box 94613 1098 XG Amsterdam NL 1090 GP Amsterdam NL P.+31 (0)20 592 3000 F.+31 (0)20 668 3167 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120313/275c6520/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4573 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120313/275c6520/attachment-0001.bin From knielson at adaptivecomputing.com Tue Mar 13 10:43:31 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 13 Mar 2012 10:43:31 -0600 Subject: [torqueusers] TORQUE 2.5.11 available Message-ID: Hi all, TORQUE 2.5.11 is now available. Please read the CHANGELOG and Release_Notes for changes and updates. TORQUE 2.5.11 can be downloaded at http://www.adaptivecomputing.com/resources/downloads/torque/ torque-2.5.11.tar.gz Thanks to everyone who made this new revision possible. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120313/5f41acf5/attachment.html From sm4082 at nyu.edu Tue Mar 13 11:44:36 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Tue, 13 Mar 2012 13:44:36 -0400 Subject: [torqueusers] 2.5.10, does anyone have problems with pbs variables? Message-ID: Hi, I have installed Torque 2.5.10 on our systems and since then we have had problems with PBS variables like PBS_NODEFILE, PBS_JOBID, PBS_JOBNAME, PBS_O_WORKDIR, etc. Surprisingly, on the same node it works ok for one user and it doesn't work for another user. Right now to solve this I am sourcing another script into every pbs script through qsub wrapper. But this line comes only after pbs directives. So, if anyone mentions these variables in #PBS -e and #PBS -o directives, then jobs are failing. Is there anyone facing the same problem? I am not sure whether this issue would be taken care of if I installed 2.5.11 as I didn't see anything regarding this in the changelogs. Here is one example. This user is always having the problems with PBS_NODEFILE variable (user runs parallel jobs with scripts in tcsh). For now, it's ok as I asked her to mention the absolute path for #PBS -e and -o lines and PBS_NODEFILE is defined by the script added to pbs script by qsub wrapper ( the added script finds the PBS_NODEFILE in /opt/torque/aux/ and exports the variable PBS_NODEFILE with the filename) Script emails me whenever it doesn't find this variable. This is an example: on host compute-8-2.local for the parallel job. "env | grep PBS" output PBS_O_QUEUE=p12 PBS_O_HOST=login-0-1.local PBS_O_HOME=/home/gs**** PBS_O_LANG=en_US.iso885915 PBS_O_LOGNAME=gs**** PBS_O_PATH=/home/gs****/bin:/bin:/share/apps/grace/5.1.22/intel/grace/bin:/share/apps/autodocksuite/4.2.1/intel/bin:/share/apps/grace/5.1.22/intel/grace/bin:/share/apps/apbs/1.1.0/intel/bin:/share/apps/gromacs/4.0.5/intel-mvapich/bin:/share/apps/amber11/intel-mvapich/amber11//bin:/share/apps/python/2.6.4/gnu/bin:/share/apps/gromacs/4.0.5/intel-mvapich/bin:/share/apps/amber11/intel-mvapich/amber11/exe:/usr/mpi/intel/mvapich-1.1.0/bin:/share/apps/matlab/R2009b/bin:/share/apps/mathematica/7.0/Executables/:/share/apps/vmd/1.8.7/:/share/apps/molden/4.7/gnu:/share/apps/mpiexec/0.84/gnu/bin:/share/apps/gaussian/G03-E01/intel/g03:/share/apps/intel/Compiler/11.1/046/bin/intel64:/usr/kerberos/bin:/usr/java/latest/bin:/usr/local/bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/opt/torque/bin:/opt/torque/sbin:/home/gs****/jobscript/gromacs/Hpluplus2oplsaa/:/share/apps/gromacs/4.0.5/i ntel-mvapich/bin:/home/gs****/coarse-grain/ElNeDyn:/home/gs****/coarse-grain:/home/gs****/jobscript/:/home/gs****/jobscript/gromacs/coarse-grain/:/home/gs****/jobscript/gromacs/:/home/gs****/nedit/:/share/apps/mmtsb_toolset/intel/bin:/share/apps/mmtsb_toolset/intel/perl:/share/apps/dssp/intel/dsspcmbi:/share/apps/python/2.6.4/gnu/bin:/share/apps/time-scapes/1.2.2/intel/test:/share/apps/time-scapes/1.2.2/intel/bin:/share/apps/pymol/1.2r3pre/gnu/bin:/home/gs****/jobscript/gromacs/Hpluplus2amber99sb/:/home/gs****/jobscript/perl/:/share/apps/dssp/intel/:/home/gs****/jobscript/autodocktools/:/share/apps/mmtsb_toolset/perl/:/home/gs****/jobscript/docking:/share/apps/tinker/4.2/intel/bin/:/home/gs****/bin PBS_O_MAIL=/var/spool/mail/gs**** PBS_O_SHELL=/bin/bash PBS_SERVER=*****.its.nyu.edu PBS_O_WORKDIR=/scratch/gs****/amber/ache.somSER/mdrunE199H/qm-mm-large3WAT/methylMIG-singleRC/qmmmMD/MD.0.24.80 PBS_JOBNAME=MD.0.24.80 PBS_JOBID=219825.crunch.local PBS_QUEUE=p12 PBS_JOBCOOKIE=C4EA940FBC845949E6DB4D1BD7855EA0 PBS_NODENUM=0 I configured torque like this: ./configure --prefix=/opt/torque --libdir=/opt/torque/lib64 --with-default-server=crunch.its.nyu.edu --with-server-home=/opt/torque --enable-docs --enable-syslog --disable-gui --enable-blcr --disable-spool --enable-cpuset --enable-geometry-requests --enable-server-xml --with-pam=/lib64/security If it's happening to anyone else, please respond to this email. Thanks, Sreedhar. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120313/7da83f90/attachment.html From stevenx.a.duchene at intel.com Tue Mar 13 13:20:02 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 13 Mar 2012 19:20:02 +0000 Subject: [torqueusers] TORQUE 4.0 ??? In-Reply-To: References: Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AB364C@ORSMSX106.amr.corp.intel.com> Any news on Torque 4.0 yet? I remember someone telling the list that something was supposed to be released on March 13th which is today... Bueller? Bueller? Bueller? -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120313/3285a15c/attachment.html From dbeer at adaptivecomputing.com Tue Mar 13 13:42:36 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 13 Mar 2012 13:42:36 -0600 Subject: [torqueusers] TORQUE 4.0 Officially Announced Message-ID: All, TORQUE 4.0 is officially here! Please check out Adaptive Computing's official announcement here: http://www.adaptivecomputing.com/adaptive-computing-offers-the-next-generation-of-high-performance-computing-with-moab-hpc-suite-7-0/ The tarball can be downloaded from here: http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.0.tar.gz We have several sites currently using 4.0 and feedback has been positive. These warnings are posted on the download site, but I am copying them here: 1. Make sure that you have openssl-devel (RedHat based) / libssl-dev (Debian based) installed (the name may differ for different operating systems) in order to be able to build TORQUE 4.0. 2. Make sure that you run the daemon trqauthd on machines that will be running client commands. NOTE: there is an init.d script for it in contrib/init.d/ but it needs customization (this includes Moab). One problem is that it has a misspelling for PBS_DAEMON - it should be /usr/local/sbin/trqauthd by default, not /usr/local/bin/trqauthd. 3. Moab needs to be started or restarted after installing TORQUE 4.0 (if you are using Moab) Please make sure to take all normal precautions for upgrading. Another advisory (not on the website) is that TORQUE now uses hwloc to manage cpusets, meaning you will need to install hwloc on your system if it isn't already there and you wish to use it. It needs to be version 1.1 or higher. The major features of the release are briefly described on the release, but the CHANGELOG for 4.0 is copied at the end of this email. This release has undergone more testing than any previous release of TORQUE; to be fair, it also has more changes than any previous version of TORQUE. Overall, we saw very good results in our beta program and most of the sites using it have had good experiences. We are proud of the quality of this release and hope that you'll try it out and let us know how it works for you. -- David Beer | Software Engineer Adaptive Computing 4.0.0 e - make a threadpool for TORQUE server. The number of threads is customizable using min_threads and max_threads, and idle time before exiting can be set using thread_idle_seconds. e - make pbs_server multi-threaded in order to increase responsiveness and scalability. e - remove the forking from pbs_server running a job, the thread handling the request just waits until the job is run. e - change qdel to simply send qdel all - previously this was executed by a qstat and a qdel of every individual job e - no longer fork to send mail, just use a thread e - use hwloc as the backbone for cpuset support in TORQUE (contributed by Dr. Bernd Kallies) e - add the boolean variable $use_smt to mom config. If set to false, this skips logical cores and uses only physical cores for the job. It is true by default. (contributed by Dr. Bernd Kallies) n - with the multi-threading the pbs_server -t create and -t cold commands could no longer ask for user input from the command line. The call to ask if the user wants to continue was moved higher in the initialization process and some of the wording changed to reflect what is now happening. e - if cpusets are configured but aren't found and cannot be mounted, pbs_mom will now fail to start instead of failing silently. e - Change node_spec from an N^2 (but average 5N) algorithm to an N algorithm with respect to nodes. We only loop over each node once at a maximum. e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be started once that can perform pbs_iff's functionality, increasing speed and enabling future security enhancements e - add mom_hierarchy functionality for reporting. The file is located in /server_priv/mom_hierarchy, and can be written to tell moms to send updates to other moms who will pass them on to pbs_server. See docs for details e - add a unit testing framework (check). It is compiled with --with-check and tests are executed using make check. The framework is complete but not many tests have been written as of yet. e - Mom rejection messages are now passed back to qrun when possible e - Added the option -c for startup. By default, the server attempts to send the mom hierarchy file to all moms on startup, and all moms update the server and request the hierarchy file. If both are trying to do this at once, it can cause a lot of traffic. -c tells pbs_server to wait 10 minutes to attempt to contact moms that haven't contacted it, reducing this traffic. e - Added mom parameter -w to reduce start times. This parameter wait to send it's first update until the server sends it the mom hierarchy file, or until 10 minutes have passed. This should reduce large cluster startup times. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120313/98915e9a/attachment-0001.html From adaptivecomputing at bridgemailsystem.com Wed Mar 14 07:12:30 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Wed, 14 Mar 2012 06:12:30 -0700 (PDT) Subject: [torqueusers] What's New from Adaptive - Moab 7.0 Message-ID: <1279377.1331730754566.JavaMail.root@mail4.bridgemailsystem.com> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120314/156330fa/attachment.html From ss18_2004 at yahoo.com Wed Mar 14 09:24:14 2012 From: ss18_2004 at yahoo.com (Calin Ilis) Date: Wed, 14 Mar 2012 08:24:14 -0700 (PDT) Subject: [torqueusers] Requeue job if node fails Message-ID: <1331738654.80908.YahooMailNeo@web114713.mail.gq1.yahoo.com> Hi, Is it possible using torque/maui to requeu a job that was executing on a node which failed. My jobs are single node jobs. So the failed node is the mother superior node. Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120314/9313c287/attachment.html From jagga13 at gmail.com Wed Mar 14 12:54:40 2012 From: jagga13 at gmail.com (Jagga Soorma) Date: Wed, 14 Mar 2012 11:54:40 -0700 Subject: [torqueusers] Kerberos tickets not being passwd via torque Message-ID: Hi Guys, I have a problem in my cluster running torque 2.5.9 not passing the kerberos tickets. I confirmed that after logging in I get a valid kerberos ticket by running kinit. However, if I do a simple qsub -I to get an interactive session I don't see that ticket anymore. But, if I simply ssh to another node on my cluster that tickets exists on that node. I don't see any enable options within configure for the torque source to enable kerberos/gssapi. Any help would be greatly appreciated. Regards, -J -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120314/172b8b7f/attachment.html From adaptivecomputing at bridgemailsystem.com Thu Mar 15 12:15:42 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Thu, 15 Mar 2012 11:15:42 -0700 (PDT) Subject: [torqueusers] Live Cloud & HPC Webinars from Adaptive Computing Message-ID: <28301405.1331835346133.JavaMail.root@mail4.bridgemailsystem.com> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120315/730af5a4/attachment-0001.html From dsimas at stanford.edu Thu Mar 15 16:54:21 2012 From: dsimas at stanford.edu (David Gabriel Simas) Date: Thu, 15 Mar 2012 15:54:21 -0700 (PDT) Subject: [torqueusers] Torque 4.0.0 and cpusets In-Reply-To: <2009330570.37067220.1331851544978.JavaMail.root@zm09.stanford.edu> Message-ID: <229313363.37076179.1331852061413.JavaMail.root@zm09.stanford.edu> Hello, I'm playing around with Torque 4.0.0 on Fedora 16 (x86_64), trying to get cpusets working. The start-up of pbs_mom is failing with the.. error Starting TORQUE Mom: pbs_mom: LOG_ERROR::No such file or directory (2) in init_torque_cpuset, (create_cpuset) failed to open /dev/cpuset/cpus Indeed, /dev/cpuset/cpus doesn't exist on my system. It seems to be named /dev/cpuset/cpuset.cpus instead. Likewise, /dev/cpuset/mems seems to be /dev/cpuset/cpuset.mems. That's the case with two kernels I've tried, 2.6.38.6-26 and 3.2.9-2. It also seems inconsistent with the documentation in cpuset(7). Has anybody else seen this? DGS From giggzounet at gmail.com Fri Mar 16 08:40:28 2012 From: giggzounet at gmail.com (giggzounet) Date: Fri, 16 Mar 2012 15:40:28 +0100 Subject: [torqueusers] Disabling or restrict in time the interactive queue Message-ID: Hi, Our university has a cluster with torque (3.0.2)/maui. We would like to disable the interactive queue or to restrict it in time. For example: - we would like to forbid "qsub -I". OR - we would like that "qsub -I" starts an interactive job for 30 minutes maximum. Is it possible ? Thx a lot, Best regards, Guillaume From L.S.Lowe at bham.ac.uk Fri Mar 16 08:48:59 2012 From: L.S.Lowe at bham.ac.uk (Lawrence Lowe) Date: Fri, 16 Mar 2012 14:48:59 +0000 (GMT) Subject: [torqueusers] Disabling or restrict in time the interactive queue In-Reply-To: References: Message-ID: Hi, see disallowed_types in http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml LSL On Fri, 16 Mar 2012, giggzounet wrote: > Hi, > > Our university has a cluster with torque (3.0.2)/maui. We would like to > disable the interactive queue or to restrict it in time. For example: > - we would like to forbid "qsub -I". > OR > - we would like that "qsub -I" starts an interactive job for 30 minutes > maximum. > > Is it possible ? > > Thx a lot, > Best regards, > Guillaume > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From sm4082 at nyu.edu Fri Mar 16 08:52:47 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 16 Mar 2012 10:52:47 -0400 Subject: [torqueusers] Disabling or restrict in time the interactive queue In-Reply-To: References: Message-ID: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> qmgr -c 'set queue disallowed_types = interactive' replace queue name with the queue you want to disable interactive jobs for. Sreedhar. On Mar 16, 2012, at 10:40 AM, giggzounet wrote: > Hi, > > Our university has a cluster with torque (3.0.2)/maui. We would like to > disable the interactive queue or to restrict it in time. For example: > - we would like to forbid "qsub -I". > OR > - we would like that "qsub -I" starts an interactive job for 30 minutes > maximum. > > Is it possible ? > > Thx a lot, > Best regards, > Guillaume > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From giggzounet at gmail.com Fri Mar 16 10:33:38 2012 From: giggzounet at gmail.com (giggzounet) Date: Fri, 16 Mar 2012 17:33:38 +0100 Subject: [torqueusers] Disabling or restrict in time the interactive queue In-Reply-To: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> Message-ID: Thx a lot! I had tested this "disallowed_types", but on a "Route" queue...And it seems to work only with "Execution" queue. Best regards, Guillaume Le 16/03/2012 15:52, Sreedhar Manchu a ?crit : > qmgr -c 'set queue disallowed_types = interactive' > > replace queue name with the queue you want to disable interactive jobs for. > > Sreedhar. > > > On Mar 16, 2012, at 10:40 AM, giggzounet wrote: > >> Hi, >> >> Our university has a cluster with torque (3.0.2)/maui. We would like to >> disable the interactive queue or to restrict it in time. For example: >> - we would like to forbid "qsub -I". >> OR >> - we would like that "qsub -I" starts an interactive job for 30 minutes >> maximum. >> >> Is it possible ? >> >> Thx a lot, >> Best regards, >> Guillaume >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Fri Mar 16 10:53:42 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 16 Mar 2012 12:53:42 -0400 Subject: [torqueusers] Disabling or restrict in time the interactive queue In-Reply-To: References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> Message-ID: <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> write a submit filter/qsub wrapper that can check for interactive flag -I and reject the job. It should be fairly simple. Or you can write a wrapper and place it on login nodes which checks for qsub args. Make it an alias for qsub. This wrapper acts before submit filter does. This also checks for args and exits when ever it finds -I arg. Like you said, disallowed doesn't work for routing queue. Sreedhar. On Mar 16, 2012, at 12:33 PM, giggzounet wrote: > Thx a lot! > > I had tested this "disallowed_types", but on a "Route" queue...And it > seems to work only with "Execution" queue. > > Best regards, > Guillaume > > Le 16/03/2012 15:52, Sreedhar Manchu a ?crit : >> qmgr -c 'set queue disallowed_types = interactive' >> >> replace queue name with the queue you want to disable interactive jobs for. >> >> Sreedhar. >> >> >> On Mar 16, 2012, at 10:40 AM, giggzounet wrote: >> >>> Hi, >>> >>> Our university has a cluster with torque (3.0.2)/maui. We would like to >>> disable the interactive queue or to restrict it in time. For example: >>> - we would like to forbid "qsub -I". >>> OR >>> - we would like that "qsub -I" starts an interactive job for 30 minutes >>> maximum. >>> >>> Is it possible ? >>> >>> Thx a lot, >>> Best regards, >>> Guillaume >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Fri Mar 16 12:22:05 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 16 Mar 2012 18:22:05 +0000 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> I am attempting to compile Maui-3.3.1 on a system where I recently installed Torque-4.0 rpms I built from the spec file included with the Torque 4.0 sources. I am getting the following error out when the system attempts to build MPBSI.c gcc -I../../include/ -I/usr/local/maui/include -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 -c MPBSI.c MPBSI.c:177: error: conflicting types for ?get_svrport? /usr/include/torque/pbs_ifl.h:681: note: previous declaration of ?get_svrport? was here MPBSI.c:178: error: conflicting types for ?openrm? /usr/include/torque/pbs_ifl.h:682: note: previous declaration of ?openrm? was here make[1]: *** [MPBSI.o] Error 1 make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' make: *** [all] Error 2 -- Steven DuChene From jfarran at uci.edu Fri Mar 16 13:10:26 2012 From: jfarran at uci.edu (Joseph Farran) Date: Fri, 16 Mar 2012 12:10:26 -0700 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> Message-ID: <4F639022.4060706@uci.edu> I too just finished testing / wasting several hours trying to go from Torque 2.4.9 to version 4 and version 4 is NOT ready for prime time. There is also a nasty bug where it will corrupt the Torque db. Did not have time to write the erros, but I was able to repeat it. Don't try to upgrade on a production system! Joseph DuChene, StevenX A wrote: > I am attempting to compile Maui-3.3.1 on a system where I recently installed Torque-4.0 rpms I built from the spec file included with the Torque 4.0 sources. I am getting the following error out when the system attempts to build MPBSI.c > > gcc -I../../include/ -I/usr/local/maui/include -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 -c MPBSI.c > MPBSI.c:177: error: conflicting types for ?get_svrport? > /usr/include/torque/pbs_ifl.h:681: note: previous declaration of ?get_svrport? was here > MPBSI.c:178: error: conflicting types for ?openrm? > /usr/include/torque/pbs_ifl.h:682: note: previous declaration of ?openrm? was here > make[1]: *** [MPBSI.o] Error 1 > make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' > make: *** [all] Error 2 > > -- > Steven DuChene > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > From stevenx.a.duchene at intel.com Fri Mar 16 13:40:46 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 16 Mar 2012 19:40:46 +0000 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <4F639022.4060706@uci.edu> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> <4F639022.4060706@uci.edu> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC79D@ORSMSX106.amr.corp.intel.com> Can you explain the steps you used to reproduce the corrupt DB problem? -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Joseph Farran Sent: Friday, March 16, 2012 12:10 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport I too just finished testing / wasting several hours trying to go from Torque 2.4.9 to version 4 and version 4 is NOT ready for prime time. There is also a nasty bug where it will corrupt the Torque db. Did not have time to write the erros, but I was able to repeat it. Don't try to upgrade on a production system! Joseph DuChene, StevenX A wrote: > I am attempting to compile Maui-3.3.1 on a system where I recently installed Torque-4.0 rpms I built from the spec file included with the Torque 4.0 sources. I am getting the following error out when the system attempts to build MPBSI.c > > gcc -I../../include/ -I/usr/local/maui/include -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 -c MPBSI.c > MPBSI.c:177: error: conflicting types for ?get_svrport? > /usr/include/torque/pbs_ifl.h:681: note: previous declaration of ?get_svrport? was here > MPBSI.c:178: error: conflicting types for ?openrm? > /usr/include/torque/pbs_ifl.h:682: note: previous declaration of ?openrm? was here > make[1]: *** [MPBSI.o] Error 1 > make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' > make: *** [all] Error 2 > > -- > Steven DuChene > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Fri Mar 16 13:45:54 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 16 Mar 2012 13:45:54 -0600 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <4F639022.4060706@uci.edu> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> <4F639022.4060706@uci.edu> Message-ID: On Fri, Mar 16, 2012 at 1:10 PM, Joseph Farran wrote: > I too just finished testing / wasting several hours trying to go from > Torque 2.4.9 to version 4 and version 4 is NOT ready for prime time. > > There is also a nasty bug where it will corrupt the Torque db. Did > not have time to write the erros, but I was able to repeat it. > Was the serverdb corrupted in 4.0 or when you went back to 2.4.9.? The Release_Notes file has a warning that the format for serverdb has changed to XML from the opaque format of 2.4.x. Ken > > Don't try to upgrade on a production system! > > Joseph > > DuChene, StevenX A wrote: > > I am attempting to compile Maui-3.3.1 on a system where I recently > installed Torque-4.0 rpms I built from the spec file included with the > Torque 4.0 sources. I am getting the following error out when the system > attempts to build MPBSI.c > > > > gcc -I../../include/ -I/usr/local/maui/include -I/usr/include/torque > -D__LINUX -D__MPBS -g -O2 -D__M64 -c MPBSI.c > > MPBSI.c:177: error: conflicting types for ?get_svrport? > > /usr/include/torque/pbs_ifl.h:681: note: previous declaration of > ?get_svrport? was here > > MPBSI.c:178: error: conflicting types for ?openrm? > > /usr/include/torque/pbs_ifl.h:682: note: previous declaration of > ?openrm? was here > > make[1]: *** [MPBSI.o] Error 1 > > make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' > > make: *** [all] Error 2 > > > > -- > > Steven DuChene > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120316/d719c3e2/attachment.html From jfarran at uci.edu Fri Mar 16 13:51:23 2012 From: jfarran at uci.edu (Joseph Farran) Date: Fri, 16 Mar 2012 12:51:23 -0700 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABC79D@ORSMSX106.amr.corp.intel.com> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> <4F639022.4060706@uci.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC79D@ORSMSX106.amr.corp.intel.com> Message-ID: <4F6399BB.3030005@uci.edu> After compiling Torque 4.0.0, I started fresh with: /opt/torque/sbin/pbs_server -t create I then re-read my qmgr comfigs from my old Torque. I was also able to stop and start Torque v4 just fine. I then added a node, or made changes to it ( I forget ) and Torque died. When I tried to start it again, it said: # service pbs_server start Starting TORQUE Server: PBS_Server: LOG_ERROR::svr_recov_xml, No server tag found in the database file??? PBS_Server: LOG_ERROR::recov_svr_attr, Unable to read server database /opt/torque/sbin/pbs_server: failed to get server attributes Fun day. DuChene, StevenX A wrote: > Can you explain the steps you used to reproduce the corrupt DB problem? > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Joseph Farran > Sent: Friday, March 16, 2012 12:10 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport > > I too just finished testing / wasting several hours trying to go from > Torque 2.4.9 to version 4 and version 4 is NOT ready for prime time. > > There is also a nasty bug where it will corrupt the Torque db. Did > not have time to write the erros, but I was able to repeat it. > > Don't try to upgrade on a production system! > > Joseph > > DuChene, StevenX A wrote: > >> I am attempting to compile Maui-3.3.1 on a system where I recently installed Torque-4.0 rpms I built from the spec file included with the Torque 4.0 sources. I am getting the following error out when the system attempts to build MPBSI.c >> >> gcc -I../../include/ -I/usr/local/maui/include -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 -c MPBSI.c >> MPBSI.c:177: error: conflicting types for ?get_svrport? >> /usr/include/torque/pbs_ifl.h:681: note: previous declaration of ?get_svrport? was here >> MPBSI.c:178: error: conflicting types for ?openrm? >> /usr/include/torque/pbs_ifl.h:682: note: previous declaration of ?openrm? was here >> make[1]: *** [MPBSI.o] Error 1 >> make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' >> make: *** [all] Error 2 >> >> -- >> Steven DuChene >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > From dbeer at adaptivecomputing.com Fri Mar 16 14:01:56 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 16 Mar 2012 14:01:56 -0600 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> Message-ID: Steven, I didn't know that this bug went all the way back to Maui. It turns out that Moab (and Maui I suppose) both called functions not in the pbs_ifl.h file, and had incorrect declarations for these functions. Since you're doing this in Maui, you may need to update the source code to use the same declaration as is in the pbs_ifl.h file. I apologize for this inconvenience, we should've put this in the release notes but we just didn't know that this bug went all the way back to Maui. David On Fri, Mar 16, 2012 at 12:22 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I am attempting to compile Maui-3.3.1 on a system where I recently > installed Torque-4.0 rpms I built from the spec file included with the > Torque 4.0 sources. I am getting the following error out when the system > attempts to build MPBSI.c > > gcc -I../../include/ -I/usr/local/maui/include -I/usr/include/torque > -D__LINUX -D__MPBS -g -O2 -D__M64 -c MPBSI.c > MPBSI.c:177: error: conflicting types for ?get_svrport? > /usr/include/torque/pbs_ifl.h:681: note: previous declaration of > ?get_svrport? was here > MPBSI.c:178: error: conflicting types for ?openrm? > /usr/include/torque/pbs_ifl.h:682: note: previous declaration of ?openrm? > was here > make[1]: *** [MPBSI.o] Error 1 > make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' > make: *** [all] Error 2 > > -- > Steven DuChene > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120316/8a023e5b/attachment.html From dbeer at adaptivecomputing.com Fri Mar 16 14:02:58 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 16 Mar 2012 14:02:58 -0600 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <4F6399BB.3030005@uci.edu> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> <4F639022.4060706@uci.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC79D@ORSMSX106.amr.corp.intel.com> <4F6399BB.3030005@uci.edu> Message-ID: On Fri, Mar 16, 2012 at 1:51 PM, Joseph Farran wrote: > After compiling Torque 4.0.0, I started fresh with: > > /opt/torque/sbin/pbs_server -t create > > I then re-read my qmgr comfigs from my old Torque. I was also able to > stop and start Torque v4 just fine. > > I then added a node, or made changes to it ( I forget ) and Torque > died. When I tried to start it again, it said: > > Can you clarify exactly what happened? Did you crash TORQUE? Were you able to get a core file? > # service pbs_server start > Starting TORQUE Server: PBS_Server: LOG_ERROR::svr_recov_xml, No > server tag found in the database file??? > PBS_Server: LOG_ERROR::recov_svr_attr, Unable to read server database > /opt/torque/sbin/pbs_server: failed to get server attributes > > Fun day. > > DuChene, StevenX A wrote: > > Can you explain the steps you used to reproduce the corrupt DB problem? > > -- > > Steven DuChene > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Joseph Farran > > Sent: Friday, March 16, 2012 12:10 PM > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] Torque-4.0 bug? conflicting types for > get_svrport > > > > I too just finished testing / wasting several hours trying to go from > > Torque 2.4.9 to version 4 and version 4 is NOT ready for prime time. > > > > There is also a nasty bug where it will corrupt the Torque db. Did > > not have time to write the erros, but I was able to repeat it. > > > > Don't try to upgrade on a production system! > > > > Joseph > > > > DuChene, StevenX A wrote: > > > >> I am attempting to compile Maui-3.3.1 on a system where I recently > installed Torque-4.0 rpms I built from the spec file included with the > Torque 4.0 sources. I am getting the following error out when the system > attempts to build MPBSI.c > >> > >> gcc -I../../include/ -I/usr/local/maui/include > -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 -c MPBSI.c > >> MPBSI.c:177: error: conflicting types for ?get_svrport? > >> /usr/include/torque/pbs_ifl.h:681: note: previous declaration of > ?get_svrport? was here > >> MPBSI.c:178: error: conflicting types for ?openrm? > >> /usr/include/torque/pbs_ifl.h:682: note: previous declaration of > ?openrm? was here > >> make[1]: *** [MPBSI.o] Error 1 > >> make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' > >> make: *** [all] Error 2 > >> > >> -- > >> Steven DuChene > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> > >> > >> > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120316/47cdf19d/attachment-0001.html From jfarran at uci.edu Fri Mar 16 14:14:43 2012 From: jfarran at uci.edu (Joseph Farran) Date: Fri, 16 Mar 2012 13:14:43 -0700 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> <4F639022.4060706@uci.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC79D@ORSMSX106.amr.corp.intel.com> <4F6399BB.3030005@uci.edu> Message-ID: <4F639F33.9090102@uci.edu> Sorry no. I made age old mistake of trying to do this on a production system instead of on a test bed, so I was hurrying things and did not get details. I am using Maui 3.3.1 and I also recompiled Maui after Torque v4. What I do know is that changing a node properties with qmgr cause /opt/torque/sbin/pbs_server to die and when I tried to start it again, I could not with the corrupted db. David Beer wrote: > > > On Fri, Mar 16, 2012 at 1:51 PM, Joseph Farran > wrote: > > After compiling Torque 4.0.0, I started fresh with: > > /opt/torque/sbin/pbs_server -t create > > I then re-read my qmgr comfigs from my old Torque. I was also > able to > stop and start Torque v4 just fine. > > I then added a node, or made changes to it ( I forget ) and Torque > died. When I tried to start it again, it said: > > > Can you clarify exactly what happened? Did you crash TORQUE? Were you > able to get a core file? > > > > # service pbs_server start > Starting TORQUE Server: PBS_Server: LOG_ERROR::svr_recov_xml, No > server tag found in the database file??? > PBS_Server: LOG_ERROR::recov_svr_attr, Unable to read server > database > /opt/torque/sbin/pbs_server: failed to get server attributes > > Fun day. > > DuChene, StevenX A wrote: > > Can you explain the steps you used to reproduce the corrupt DB > problem? > > -- > > Steven DuChene > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org > > [mailto:torqueusers-bounces at supercluster.org > ] On Behalf Of Joseph > Farran > > Sent: Friday, March 16, 2012 12:10 PM > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] Torque-4.0 bug? conflicting types for > get_svrport > > > > I too just finished testing / wasting several hours trying to go > from > > Torque 2.4.9 to version 4 and version 4 is NOT ready for prime time. > > > > There is also a nasty bug where it will corrupt the Torque db. > Did > > not have time to write the erros, but I was able to repeat it. > > > > Don't try to upgrade on a production system! > > > > Joseph > > > > DuChene, StevenX A wrote: > > > >> I am attempting to compile Maui-3.3.1 on a system where I > recently installed Torque-4.0 rpms I built from the spec file > included with the Torque 4.0 sources. I am getting the following > error out when the system attempts to build MPBSI.c > >> > >> gcc -I../../include/ -I/usr/local/maui/include > -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 > -c MPBSI.c > >> MPBSI.c:177: error: conflicting types for ?get_svrport? > >> /usr/include/torque/pbs_ifl.h:681: note: previous declaration > of ?get_svrport? was here > >> MPBSI.c:178: error: conflicting types for ?openrm? > >> /usr/include/torque/pbs_ifl.h:682: note: previous declaration > of ?openrm? was here > >> make[1]: *** [MPBSI.o] Error 1 > >> make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' > >> make: *** [all] Error 2 > >> > >> -- > >> Steven DuChene > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> > >> > >> > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > From dbeer at adaptivecomputing.com Fri Mar 16 14:37:27 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 16 Mar 2012 14:37:27 -0600 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: <4F639F33.9090102@uci.edu> References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> <4F639022.4060706@uci.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC79D@ORSMSX106.amr.corp.intel.com> <4F6399BB.3030005@uci.edu> <4F639F33.9090102@uci.edu> Message-ID: Joseph, Sorry, I'm unable to reproduce your problem with the information that you provided. I was able to change properties on an existing node, add a new node, and then change the properties on that node (as well as np) all without crashing. David On Fri, Mar 16, 2012 at 2:14 PM, Joseph Farran wrote: > Sorry no. > I made age old mistake of trying to do this on a production system instead > of on a test bed, so I was hurrying things and did not get details. > > I am using Maui 3.3.1 and I also recompiled Maui after Torque v4. > What I do know is that changing a node properties with qmgr cause > /opt/torque/sbin/pbs_server to die and when I tried to start it again, I > could not with the corrupted db. > > > David Beer wrote: > >> >> >> On Fri, Mar 16, 2012 at 1:51 PM, Joseph Farran > jfarran at uci.edu>> wrote: >> >> After compiling Torque 4.0.0, I started fresh with: >> >> /opt/torque/sbin/pbs_server -t create >> >> I then re-read my qmgr comfigs from my old Torque. I was also >> able to >> stop and start Torque v4 just fine. >> >> I then added a node, or made changes to it ( I forget ) and Torque >> died. When I tried to start it again, it said: >> >> >> Can you clarify exactly what happened? Did you crash TORQUE? Were you >> able to get a core file? >> >> >> # service pbs_server start >> Starting TORQUE Server: PBS_Server: LOG_ERROR::svr_recov_xml, No >> server tag found in the database file??? >> PBS_Server: LOG_ERROR::recov_svr_attr, Unable to read server >> database >> /opt/torque/sbin/pbs_server: failed to get server attributes >> >> Fun day. >> >> DuChene, StevenX A wrote: >> > Can you explain the steps you used to reproduce the corrupt DB >> problem? >> > -- >> > Steven DuChene >> > >> > -----Original Message----- >> > From: torqueusers-bounces@**supercluster.org >> >> > >> [mailto:torqueusers-bounces@**supercluster.org >> >] >> On Behalf Of Joseph >> Farran >> > Sent: Friday, March 16, 2012 12:10 PM >> > To: Torque Users Mailing List >> > Subject: Re: [torqueusers] Torque-4.0 bug? conflicting types for >> get_svrport >> > >> > I too just finished testing / wasting several hours trying to go >> from >> > Torque 2.4.9 to version 4 and version 4 is NOT ready for prime time. >> > >> > There is also a nasty bug where it will corrupt the Torque db. >> Did >> > not have time to write the erros, but I was able to repeat it. >> > >> > Don't try to upgrade on a production system! >> > >> > Joseph >> > >> > DuChene, StevenX A wrote: >> > >> >> I am attempting to compile Maui-3.3.1 on a system where I >> recently installed Torque-4.0 rpms I built from the spec file >> included with the Torque 4.0 sources. I am getting the following >> error out when the system attempts to build MPBSI.c >> >> >> >> gcc -I../../include/ -I/usr/local/maui/include >> -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 >> -c MPBSI.c >> >> MPBSI.c:177: error: conflicting types for ?get_svrport? >> >> /usr/include/torque/pbs_ifl.h:**681: note: previous declaration >> of ?get_svrport? was here >> >> MPBSI.c:178: error: conflicting types for ?openrm? >> >> /usr/include/torque/pbs_ifl.h:**682: note: previous declaration >> of ?openrm? was here >> >> make[1]: *** [MPBSI.o] Error 1 >> >> make[1]: Leaving directory `/usr/local/src/maui-3.3.1/**src/moab' >> >> make: *** [all] Error 2 >> >> >> >> -- >> >> Steven DuChene >> >> >> >> ______________________________**_________________ >> >> torqueusers mailing list >> >> torqueusers at supercluster.org >> > >> >> >> http://www.supercluster.org/**mailman/listinfo/torqueusers >> >> >> >> >> >> >> >> >> > ______________________________**_________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > >> >> > http://www.supercluster.org/**mailman/listinfo/torqueusers >> > >> > >> > >> ______________________________**_________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> > >> >> http://www.supercluster.org/**mailman/listinfo/torqueusers >> >> >> >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120316/ce47464e/attachment.html From jfarran at uci.edu Fri Mar 16 14:40:28 2012 From: jfarran at uci.edu (Joseph Farran) Date: Fri, 16 Mar 2012 13:40:28 -0700 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> <4F639022.4060706@uci.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC79D@ORSMSX106.amr.corp.intel.com> <4F6399BB.3030005@uci.edu> <4F639F33.9090102@uci.edu> Message-ID: <4F63A53C.4030502@uci.edu> Thanks David. I need to setup a test bed and will try to reproduce the issue with more information to provide. Joseph David Beer wrote: > Joseph, > > Sorry, I'm unable to reproduce your problem with the information that > you provided. I was able to change properties on an existing node, add > a new node, and then change the properties on that node (as well as > np) all without crashing. > > David > > On Fri, Mar 16, 2012 at 2:14 PM, Joseph Farran > wrote: > > Sorry no. > I made age old mistake of trying to do this on a production system > instead of on a test bed, so I was hurrying things and did not get > details. > > I am using Maui 3.3.1 and I also recompiled Maui after Torque v4. > What I do know is that changing a node properties with qmgr > cause /opt/torque/sbin/pbs_server to die and when I tried to start > it again, I could not with the corrupted db. > > > David Beer wrote: > > > > On Fri, Mar 16, 2012 at 1:51 PM, Joseph Farran > > >> wrote: > > After compiling Torque 4.0.0, I started fresh with: > > /opt/torque/sbin/pbs_server -t create > > I then re-read my qmgr comfigs from my old Torque. I was also > able to > stop and start Torque v4 just fine. > > I then added a node, or made changes to it ( I forget ) and > Torque > died. When I tried to start it again, it said: > > > Can you clarify exactly what happened? Did you crash TORQUE? > Were you able to get a core file? > > > # service pbs_server start > Starting TORQUE Server: PBS_Server: > LOG_ERROR::svr_recov_xml, No > server tag found in the database file??? > PBS_Server: LOG_ERROR::recov_svr_attr, Unable to read server > database > /opt/torque/sbin/pbs_server: failed to get server attributes > > Fun day. > > DuChene, StevenX A wrote: > > Can you explain the steps you used to reproduce the > corrupt DB > problem? > > -- > > Steven DuChene > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org > > > > [mailto:torqueusers-bounces at supercluster.org > > >] On Behalf Of > Joseph > Farran > > Sent: Friday, March 16, 2012 12:10 PM > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] Torque-4.0 bug? conflicting > types for > get_svrport > > > > I too just finished testing / wasting several hours > trying to go > from > > Torque 2.4.9 to version 4 and version 4 is NOT ready for > prime time. > > > > There is also a nasty bug where it will corrupt the > Torque db. Did > > not have time to write the erros, but I was able to > repeat it. > > > > Don't try to upgrade on a production system! > > > > Joseph > > > > DuChene, StevenX A wrote: > > > >> I am attempting to compile Maui-3.3.1 on a system where I > recently installed Torque-4.0 rpms I built from the spec file > included with the Torque 4.0 sources. I am getting the > following > error out when the system attempts to build MPBSI.c > >> > >> gcc -I../../include/ -I/usr/local/maui/include > -I/usr/include/torque -D__LINUX -D__MPBS -g -O2 -D__M64 > -c MPBSI.c > >> MPBSI.c:177: error: conflicting types for ?get_svrport? > >> /usr/include/torque/pbs_ifl.h:681: note: previous > declaration > of ?get_svrport? was here > >> MPBSI.c:178: error: conflicting types for ?openrm? > >> /usr/include/torque/pbs_ifl.h:682: note: previous > declaration > of ?openrm? was here > >> make[1]: *** [MPBSI.o] Error 1 > >> make[1]: Leaving directory > `/usr/local/src/maui-3.3.1/src/moab' > >> make: *** [all] Error 2 > >> > >> -- > >> Steven DuChene > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > > > > > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> > >> > >> > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > > > -- > David Beer | Software Engineer > Adaptive Computing > From stevenx.a.duchene at intel.com Fri Mar 16 16:47:17 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 16 Mar 2012 22:47:17 +0000 Subject: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport In-Reply-To: References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> <4514DE16-9BB6-4339-90BB-BC699BC427EB@nyu.edu> <560DBE57F33C4C4C9FBF11C662951AF805ABC77E@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC82D@ORSMSX106.amr.corp.intel.com> Yes, I made modifications to the extern lines for get_svrport and openrm in src/moab/MPBSI.c so the defines corresponded with the entries in pbs_ifl.h After that maui built and seems to run fine. So will there be a note or notes added to the release notes for Torque-4.X ? -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Friday, March 16, 2012 1:02 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Torque-4.0 bug? conflicting types for get_svrport Steven,? I didn't know that this bug went all the way back to Maui. It turns out that Moab (and Maui I suppose) both called functions not in the pbs_ifl.h file, and had incorrect declarations for these functions. Since you're doing this in Maui, you may need to update the source code to use the same declaration as is in the pbs_ifl.h file. I apologize for this inconvenience, we should've put this in the release notes but we just didn't know that this bug went all the way back to Maui. David On Fri, Mar 16, 2012 at 12:22 PM, DuChene, StevenX A wrote: I am attempting to compile Maui-3.3.1 on a system where I recently installed Torque-4.0 rpms I built from the spec file included with the Torque 4.0 sources. I am getting the following error out when the system attempts to build MPBSI.c gcc -I../../include/ -I/usr/local/maui/include ? ? -I/usr/include/torque -D__LINUX ? -D__MPBS ? ? ? ?-g -O2 -D__M64 ?-c MPBSI.c MPBSI.c:177: error: conflicting types for ?get_svrport? /usr/include/torque/pbs_ifl.h:681: note: previous declaration of ?get_svrport? was here MPBSI.c:178: error: conflicting types for ?openrm? /usr/include/torque/pbs_ifl.h:682: note: previous declaration of ?openrm? was here make[1]: *** [MPBSI.o] Error 1 make[1]: Leaving directory `/usr/local/src/maui-3.3.1/src/moab' make: *** [all] Error 2 -- Steven DuChene _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing From stevenx.a.duchene at intel.com Fri Mar 16 18:45:55 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Sat, 17 Mar 2012 00:45:55 +0000 Subject: [torqueusers] maui and torque not communicating Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> I am just wondering if anyone actually has Torque-4.0 running and working with Maui as the scheduler? I have Torque-4.0 compiled and running without the pbs_sched part installed. I have Maui-3.3.1 installed and running as well but it really seems like the two systems are not really talking to each other. If I submit jobs with qsub from Torque I can see them sitting in the queue with qstat: [root at elogin2 hwloc-1.4.1]# qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 2.elogin2 script.pbs saducheX 0 Q batch But if I then use showq (a moui tool) the job does not show up. [saducheX at elogin2 ~]$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 1024 Processors Active (0.00%) 0 of 256 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 If I try to run mdiag -j 2 it returns nothing: [root at elogin2 hwloc-1.4.1]# mdiag -j 2 Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features The checkjob util says: [root at elogin2 hwloc-1.4.1]# checkjob 2 ERROR: 'checkjob' failed ERROR: cannot locate job '2' [saducheX at elogin2 ~]$ checkjob 2.elogin2 ERROR: 'checkjob' failed ERROR: cannot locate job '2.elogin2' So my basic question is does anyone have maui working with Torque-4.0? If so what did you have to do to get things operational? Is there something I am missing? -- Steven DuChene From stevenx.a.duchene at intel.com Fri Mar 16 18:50:32 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Sat, 17 Mar 2012 00:50:32 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC87F@ORSMSX106.amr.corp.intel.com> BTW, in my maui.log file I am seeing the following: 03/16 17:48:21 MRMWorkloadQuery() 03/16 17:48:21 MPBSWorkloadQuery(ELOGIN2,JCount,SC) 03/16 17:48:21 INFO: queue is empty 03/16 17:48:21 INFO: 0 PBS jobs detected on RM ELOGIN2 03/16 17:48:21 WARNING: no workload detected Even though qstat from torque returns: [root at elogin2 log]# qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 2.elogin2 script.pbs saducheX 0 Q batch -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Friday, March 16, 2012 5:46 PM To: torqueusers at supercluster.org Subject: [torqueusers] maui and torque not communicating I am just wondering if anyone actually has Torque-4.0 running and working with Maui as the scheduler? I have Torque-4.0 compiled and running without the pbs_sched part installed. I have Maui-3.3.1 installed and running as well but it really seems like the two systems are not really talking to each other. If I submit jobs with qsub from Torque I can see them sitting in the queue with qstat: [root at elogin2 hwloc-1.4.1]# qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 2.elogin2 script.pbs saducheX 0 Q batch But if I then use showq (a moui tool) the job does not show up. [saducheX at elogin2 ~]$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 1024 Processors Active (0.00%) 0 of 256 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 If I try to run mdiag -j 2 it returns nothing: [root at elogin2 hwloc-1.4.1]# mdiag -j 2 Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features The checkjob util says: [root at elogin2 hwloc-1.4.1]# checkjob 2 ERROR: 'checkjob' failed ERROR: cannot locate job '2' [saducheX at elogin2 ~]$ checkjob 2.elogin2 ERROR: 'checkjob' failed ERROR: cannot locate job '2.elogin2' So my basic question is does anyone have maui working with Torque-4.0? If so what did you have to do to get things operational? Is there something I am missing? -- Steven DuChene _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Fri Mar 16 20:26:53 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Sat, 17 Mar 2012 02:26:53 +0000 Subject: [torqueusers] TORQUE 4.0 Officially Announced In-Reply-To: References: Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> It is unclear from this announcement text where hwloc has to be installed. Is it just on the server or on the nodes only? I looked in the various README files and the Release_Notes file packages with the sources and there is no mention of hwloc in those at all. There is only the one short mention in the CHANGELOG file that is even less than what is in the announcement below. More documentation about this would be greatly appreciated. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, March 13, 2012 12:43 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: [torqueusers] TORQUE 4.0 Officially Announced All, TORQUE 4.0 is officially here! Please check out Adaptive Computing's official announcement here: http://www.adaptivecomputing.com/adaptive-computing-offers-the-next-generation-of-high-performance-computing-with-moab-hpc-suite-7-0/ The tarball can be downloaded from here: http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.0.tar.gz We have several sites currently using 4.0 and feedback has been positive. These warnings are posted on the download site, but I am copying them here: 1. Make sure that you have openssl-devel (RedHat based) / libssl-dev (Debian based) installed (the name may differ for different operating systems) in order to be able to build TORQUE 4.0. 2. Make sure that you run the daemon trqauthd on machines that will be running client commands. NOTE: there is an init.d script for it in contrib/init.d/ but it needs customization (this includes Moab). One problem is that it has a misspelling for PBS_DAEMON - it should be /usr/local/sbin/trqauthd by default, not /usr/local/bin/trqauthd. 3. Moab needs to be started or restarted after installing TORQUE 4.0 (if you are using Moab) Please make sure to take all normal precautions for upgrading. Another advisory (not on the website) is that TORQUE now uses hwloc to manage cpusets, meaning you will need to install hwloc on your system if it isn't already there and you wish to use it. It needs to be version 1.1 or higher. The major features of the release are briefly described on the release, but the CHANGELOG for 4.0 is copied at the end of this email. This release has undergone more testing than any previous release of TORQUE; to be fair, it also has more changes than any previous version of TORQUE. Overall, we saw very good results in our beta program and most of the sites using it have had good experiences. We are proud of the quality of this release and hope that you'll try it out and let us know how it works for you. -- David Beer | Software Engineer Adaptive Computing 4.0.0 e - make a threadpool for TORQUE server. The number of threads is customizable using min_threads and max_threads, and idle time before exiting can be set using thread_idle_seconds. e - make pbs_server multi-threaded in order to increase responsiveness and scalability. e - remove the forking from pbs_server running a job, the thread handling the request just waits until the job is run. e - change qdel to simply send qdel all - previously this was executed by a qstat and a qdel of every individual job e - no longer fork to send mail, just use a thread e - use hwloc as the backbone for cpuset support in TORQUE (contributed by Dr. Bernd Kallies) e - add the boolean variable $use_smt to mom config. If set to false, this skips logical cores and uses only physical cores for the job. It is true by default. (contributed by Dr. Bernd Kallies) n - with the multi-threading the pbs_server -t create and -t cold commands could no longer ask for user input from the command line. The call to ask if the user wants to continue was moved higher in the initialization process and some of the wording changed to reflect what is now happening. e - if cpusets are configured but aren't found and cannot be mounted, pbs_mom will now fail to start instead of failing silently. e - Change node_spec from an N^2 (but average 5N) algorithm to an N algorithm with respect to nodes. We only loop over each node once at a maximum. e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be started once that can perform pbs_iff's functionality, increasing speed and enabling future security enhancements e - add mom_hierarchy functionality for reporting. The file is located in /server_priv/mom_hierarchy, and can be written to tell moms to send updates to other moms who will pass them on to pbs_server. See docs for details e - add a unit testing framework (check). It is compiled with --with-check and tests are executed using make check. The framework is complete but not many tests have been written as of yet. e - Mom rejection messages are now passed back to qrun when possible e - Added the option -c for startup. By default, the server attempts to send the mom hierarchy file to all moms on startup, and all moms update the server and request the hierarchy file. If both are trying to do this at once, it can cause a lot of traffic. -c tells pbs_server to wait 10 minutes to attempt to contact moms that haven't contacted it, reducing this traffic. e - Added mom parameter -w to reduce start times. This parameter wait to send it's first update until the server sends it the mom hierarchy file, or until 10 minutes have passed. This should reduce large cluster startup times. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120317/3dbbc4a8/attachment-0001.html From jfarran at uci.edu Sat Mar 17 11:50:39 2012 From: jfarran at uci.edu (Joseph A. Farran) Date: Sat, 17 Mar 2012 10:50:39 -0700 Subject: [torqueusers] Email Notification without #PBS -M Message-ID: <4F64CEEF.1060407@uci.edu> Hello. I am using Torque 2.5.10 If I am reading the Torque manual correctly, one can have: #PBS -m abe without the "#PBS -M email at address". On page 135 of the Torque admin when talking about "#PBS -M"it says: "If unset, the list defaults to the submitting user at the qsub host, i.e. the job owner." How can I have "#PBS -m abe" without including "#PBS -M me at mysite"? I have several pbs examples that our users use and I want an automatic "#PBS -M $USER" By the way, "#PBS -M " works but not "#PBS -M $USER". Joseph From jfarran at uci.edu Sat Mar 17 11:56:26 2012 From: jfarran at uci.edu (Joseph A. Farran) Date: Sat, 17 Mar 2012 10:56:26 -0700 Subject: [torqueusers] Email Notification without #PBS -M In-Reply-To: <4F64CEEF.1060407@uci.edu> References: <4F64CEEF.1060407@uci.edu> Message-ID: <4F64D04A.9020007@uci.edu> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120317/22bc0540/attachment.html From akohlmey at cmm.chem.upenn.edu Sat Mar 17 12:05:17 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Sat, 17 Mar 2012 14:05:17 -0400 Subject: [torqueusers] Email Notification without #PBS -M In-Reply-To: <4F64D04A.9020007@uci.edu> References: <4F64CEEF.1060407@uci.edu> <4F64D04A.9020007@uci.edu> Message-ID: On Sat, Mar 17, 2012 at 1:56 PM, Joseph A. Farran wrote: > I forgot to say that simply having: > > ??? #PBS -m abe > > without "#PBS -M " will not send email. > > > On 3/17/2012 10:50 AM, Joseph A. Farran wrote: > > Hello. > > I am using Torque 2.5.10 If I am reading the Torque manual correctly, one > can have: > > #PBS -m abe > > without the "#PBS -M email at address". On page 135 of the Torque admin when > talking about "#PBS -M"it says: > > "If unset, the list defaults to the submitting user at the qsub host, i.e. > the job owner." > > How can I have "#PBS -m abe" without including "#PBS -M me at mysite"? > > I have several pbs examples that our users use and I want an automatic "#PBS > -M $USER" > > By the way, "#PBS -M " works but not "#PBS -M $USER". it works for me here. but i had to make an effort so that local mail submission on the cluster forwards e-mails to a canonical email address. otherwise the emails will just sit in your mail spool forever. which is the better way, depends on your preferences and the MTA you want to use. in short, it is an issue of configuring the local mail setup, not so much of the torque setup. axel. > > Joseph > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From jfarran at uci.edu Sat Mar 17 12:26:36 2012 From: jfarran at uci.edu (Joseph A. Farran) Date: Sat, 17 Mar 2012 11:26:36 -0700 Subject: [torqueusers] Email Notification without #PBS -M In-Reply-To: References: <4F64CEEF.1060407@uci.edu> <4F64D04A.9020007@uci.edu> Message-ID: <4F64D75C.3000808@uci.edu> Thanks Axel. You are correct. For me, I also forgot about "set server mail_domain = " which had to be set to the submitting host. Setting this to the submission host did the trick. Best, Joseph On 3/17/2012 11:05 AM, Axel Kohlmeyer wrote: > On Sat, Mar 17, 2012 at 1:56 PM, Joseph A. Farran wrote: >> I forgot to say that simply having: >> >> #PBS -m abe >> >> without "#PBS -M " will not send email. >> >> >> On 3/17/2012 10:50 AM, Joseph A. Farran wrote: >> >> Hello. >> >> I am using Torque 2.5.10 If I am reading the Torque manual correctly, one >> can have: >> >> #PBS -m abe >> >> without the "#PBS -M email at address". On page 135 of the Torque admin when >> talking about "#PBS -M"it says: >> >> "If unset, the list defaults to the submitting user at the qsub host, i.e. >> the job owner." >> >> How can I have "#PBS -m abe" without including "#PBS -M me at mysite"? >> >> I have several pbs examples that our users use and I want an automatic "#PBS >> -M $USER" >> >> By the way, "#PBS -M" works but not "#PBS -M $USER". > it works for me here. but i had to make an effort so > that local mail submission on the cluster forwards > e-mails to a canonical email address. otherwise > the emails will just sit in your mail spool forever. > which is the better way, depends on your preferences > and the MTA you want to use. > > in short, it is an issue of configuring the local > mail setup, not so much of the torque setup. > > axel. > > >> Joseph >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > From cwest at vpac.org Sun Mar 18 23:40:14 2012 From: cwest at vpac.org (Craig West) Date: Mon, 19 Mar 2012 16:40:14 +1100 Subject: [torqueusers] TORQUE 4.0 Officially Announced In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> Message-ID: <4F66C6BE.3010605@vpac.org> Hi Steven, I have just begun testing Torque 4.0, as hwloc has been a long awaited feature for me. > It is unclear from this announcement text where hwloc has to be installed. > Is it just on the server or on the nodes only? It needs to be available on the BUILD server and the nodes. I tried to run pbs_mom on a node without the hwloc I had installed and it failed. Note: I am running hwloc 1.4 from a directory in /usr/local This was not automatically found by the TORQUE configure script, but you can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. It embeds the locations that you specify in the pbs_mom (and other files) but it seems you can set the LD_LIBRARY_PATH variable if it is not in the same location on the BUILD server as the compute nodes. For simplicity installing them in the same location makes sense. > More documentation about this would be greatly appreciated. I agree, clearer and more detailed documentation would be useful. Cheers, Craig. From stevenx.a.duchene at intel.com Mon Mar 19 08:47:28 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 19 Mar 2012 14:47:28 +0000 Subject: [torqueusers] TORQUE 4.0 Officially Announced In-Reply-To: <4F66C6BE.3010605@vpac.org> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> Also a better (more complete) explanation of what features are enabled when hwloc is used would be helpful as well. BTW, I built torque on my server without hwloc installed and then installed the resulting mom packages on my nodes. The mom daemons in that case did seem to start up just fine. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Craig West Sent: Sunday, March 18, 2012 10:40 PM To: Torque Users mailing list; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced Hi Steven, I have just begun testing Torque 4.0, as hwloc has been a long awaited feature for me. > It is unclear from this announcement text where hwloc has to be installed. > Is it just on the server or on the nodes only? It needs to be available on the BUILD server and the nodes. I tried to run pbs_mom on a node without the hwloc I had installed and it failed. Note: I am running hwloc 1.4 from a directory in /usr/local This was not automatically found by the TORQUE configure script, but you can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. It embeds the locations that you specify in the pbs_mom (and other files) but it seems you can set the LD_LIBRARY_PATH variable if it is not in the same location on the BUILD server as the compute nodes. For simplicity installing them in the same location makes sense. > More documentation about this would be greatly appreciated. I agree, clearer and more detailed documentation would be useful. Cheers, Craig. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Mon Mar 19 09:26:40 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 19 Mar 2012 09:26:40 -0600 Subject: [torqueusers] TORQUE 4.0 Officially Announced In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> Message-ID: On Fri, Mar 16, 2012 at 8:26 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > It is unclear from this announcement text where hwloc has to be > installed.**** > > Is it just on the server or on the nodes only?**** > > I looked in the various README files and the Release_Notes file packages > with the sources and there is no mention of hwloc in those at all. There is > only the one short mention in the CHANGELOG file that is even less than > what is in the announcement below.**** > > ** ** > > More documentation about this would be greatly appreciated.**** > > --**** > > Steven DuChene > Steve, Consider it done. It will be part of 4.0.1. Ken > **** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Tuesday, March 13, 2012 12:43 PM > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* [torqueusers] TORQUE 4.0 Officially Announced**** > > ** ** > > All,**** > > ** ** > > TORQUE 4.0 is officially here! Please check out Adaptive Computing's > official announcement here: > http://www.adaptivecomputing.com/adaptive-computing-offers-the-next-generation-of-high-performance-computing-with-moab-hpc-suite-7-0/ > **** > > ** ** > > The tarball can be downloaded from here: > http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.0.tar.gz > **** > > ** ** > > We have several sites currently using 4.0 and feedback has been positive. > These warnings are posted on the download site, but I am copying them here: > **** > > ** ** > > 1. Make sure that you have openssl-devel (RedHat based) / libssl-dev > (Debian based) installed (the name may differ for different operating > systems) in order to be able to build TORQUE 4.0.**** > > 2. Make sure that you run the daemon trqauthd on machines that will be > running client commands. NOTE: there is an init.d script for it in > contrib/init.d/ but it needs customization (this includes Moab). One > problem is that it has a misspelling for PBS_DAEMON - it should be > /usr/local/sbin/trqauthd by default, not /usr/local/bin/trqauthd.**** > > 3. Moab needs to be started or restarted after installing TORQUE 4.0 (if > you are using Moab)**** > > ** ** > > Please make sure to take all normal precautions for upgrading. Another > advisory (not on the website) is that TORQUE now uses hwloc to manage > cpusets, meaning you will need to install hwloc on your system if it isn't > already there and you wish to use it. It needs to be version 1.1 or higher. > **** > > ** ** > > The major features of the release are briefly described on the release, > but the CHANGELOG for 4.0 is copied at the end of this email. **** > > ** ** > > This release has undergone more testing than any previous release of > TORQUE; to be fair, it also has more changes than any previous version of > TORQUE. Overall, we saw very good results in our beta program and most of > the sites using it have had good experiences. We are proud of the quality > of this release and hope that you'll try it out and let us know how it > works for you.**** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > ** ** > > 4.0.0**** > > e - make a threadpool for TORQUE server. The number of threads is**** > > customizable using min_threads and max_threads, and idle time before > **** > > exiting can be set using thread_idle_seconds.**** > > e - make pbs_server multi-threaded in order to increase responsiveness > and scalability.**** > > e - remove the forking from pbs_server running a job, the thread > handling the request just**** > > waits until the job is run.**** > > e - change qdel to simply send qdel all - previously this was executed > by a qstat and a qdel**** > > of every individual job**** > > e - no longer fork to send mail, just use a thread**** > > e - use hwloc as the backbone for cpuset support in TORQUE (contributed > by Dr. Bernd Kallies)**** > > e - add the boolean variable $use_smt to mom config. If set to false, > this skips logical**** > > cores and uses only physical cores for the job. It is true by > default.**** > > (contributed by Dr. Bernd Kallies)**** > > n - with the multi-threading the pbs_server -t create and -t cold > commands could no longer**** > > ask for user input from the command line. The call to ask if the > user wants to continue**** > > was moved higher in the initialization process and some of the > wording changed to**** > > reflect what is now happening.**** > > e - if cpusets are configured but aren't found and cannot be mounted, > pbs_mom will now fail to**** > > start instead of failing silently.**** > > e - Change node_spec from an N^2 (but average 5N) algorithm to an N > algorithm with respect**** > > to nodes. We only loop over each node once at a maximum.**** > > e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be > started once that can**** > > perform pbs_iff's functionality, increasing speed and enabling > future security**** > > enhancements**** > > e - add mom_hierarchy functionality for reporting. The file is located in > **** > > /server_priv/mom_hierarchy, and can be written to tell > moms to send**** > > updates to other moms who will pass them on to pbs_server. See docs > for details**** > > e - add a unit testing framework (check). It is compiled with > --with-check and tests**** > > are executed using make check. The framework is complete but not > many tests have**** > > been written as of yet.**** > > e - Mom rejection messages are now passed back to qrun when possible**** > > e - Added the option -c for startup. By default, the server attempts to > send the mom**** > > hierarchy file to all moms on startup, and all moms update the > server and request**** > > the hierarchy file. If both are trying to do this at once, it can > cause a lot of**** > > traffic. -c tells pbs_server to wait 10 minutes to attempt to > contact moms that**** > > haven't contacted it, reducing this traffic.**** > > e - Added mom parameter -w to reduce start times. This parameter wait to > send it's**** > > first update until the server sends it the mom hierarchy file, or > until 10**** > > minutes have passed. This should reduce large cluster startup times. > **** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/77b2cb60/attachment-0001.html From knielson at adaptivecomputing.com Mon Mar 19 09:41:57 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 19 Mar 2012 09:41:57 -0600 Subject: [torqueusers] maui and torque not communicating In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> Message-ID: On Fri, Mar 16, 2012 at 6:45 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I am just wondering if anyone actually has Torque-4.0 running and working > with Maui as the scheduler? > > I have Torque-4.0 compiled and running without the pbs_sched part > installed. > > I have Maui-3.3.1 installed and running as well but it really seems like > the two systems are not really talking to each other. > > If I submit jobs with qsub from Torque I can see them sitting in the queue > with qstat: > > [root at elogin2 hwloc-1.4.1]# qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 2.elogin2 script.pbs saducheX 0 Q > batch > > But if I then use showq (a moui tool) the job does not show up. > > [saducheX at elogin2 ~]$ showq > ACTIVE JOBS-------------------- > JOBNAME USERNAME STATE PROC REMAINING > STARTTIME > > > 0 Active Jobs 0 of 1024 Processors Active (0.00%) > 0 of 256 Nodes Active (0.00%) > > IDLE JOBS---------------------- > JOBNAME USERNAME STATE PROC WCLIMIT > QUEUETIME > > > 0 Idle Jobs > > BLOCKED JOBS---------------- > JOBNAME USERNAME STATE PROC WCLIMIT > QUEUETIME > > > Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 > > If I try to run mdiag -j 2 it returns nothing: > > [root at elogin2 hwloc-1.4.1]# mdiag -j 2 > Name State Par Proc QOS WCLimit R Min User > Group Account QueuedTime Network Opsys Arch Mem Disk Procs > Class Features > > The checkjob util says: > > [root at elogin2 hwloc-1.4.1]# checkjob 2 > ERROR: 'checkjob' failed > ERROR: cannot locate job '2' > > [saducheX at elogin2 ~]$ checkjob 2.elogin2 > ERROR: 'checkjob' failed > ERROR: cannot locate job '2.elogin2' > > So my basic question is does anyone have maui working with Torque-4.0? > If so what did you have to do to get things operational? > Is there something I am missing? > -- > Steven DuChene > Steve, What version of Linux are you running? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/88fabdff/attachment.html From stevenx.a.duchene at intel.com Mon Mar 19 09:51:13 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 19 Mar 2012 15:51:13 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> RHEL6.1 with latest standard kernel. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 8:42 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Fri, Mar 16, 2012 at 6:45 PM, DuChene, StevenX A > wrote: I am just wondering if anyone actually has Torque-4.0 running and working with Maui as the scheduler? I have Torque-4.0 compiled and running without the pbs_sched part installed. I have Maui-3.3.1 installed and running as well but it really seems like the two systems are not really talking to each other. If I submit jobs with qsub from Torque I can see them sitting in the queue with qstat: [root at elogin2 hwloc-1.4.1]# qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 2.elogin2 script.pbs saducheX 0 Q batch But if I then use showq (a moui tool) the job does not show up. [saducheX at elogin2 ~]$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 1024 Processors Active (0.00%) 0 of 256 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 If I try to run mdiag -j 2 it returns nothing: [root at elogin2 hwloc-1.4.1]# mdiag -j 2 Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features The checkjob util says: [root at elogin2 hwloc-1.4.1]# checkjob 2 ERROR: 'checkjob' failed ERROR: cannot locate job '2' [saducheX at elogin2 ~]$ checkjob 2.elogin2 ERROR: 'checkjob' failed ERROR: cannot locate job '2.elogin2' So my basic question is does anyone have maui working with Torque-4.0? If so what did you have to do to get things operational? Is there something I am missing? -- Steven DuChene Steve, What version of Linux are you running? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/c34523e9/attachment.html From dbeer at adaptivecomputing.com Mon Mar 19 09:54:04 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 19 Mar 2012 09:54:04 -0600 Subject: [torqueusers] TORQUE 4.0 Officially Announced In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> Message-ID: Steve, Hwloc is now required for running cpusets in TORQUE, and it helps out a lot both in immediate use and in groundwork for future features. Immediately hwloc gives you a better cpuset because it gives you the next core instead of the next indexed core. For example: many eight core systems have processors 0, 2, 4, and 6 next to each other and processors 1, 3, 5, and 7 next to each other. If you're running a pre-4.0 TORQUE, and you have two jobs on the node, each with 4 cores, job 1 will have 0-3 and job 2 will have 4-7. In TORQUE 4.0, job 1 will have 0, 2, 4, and 6, and job 2 will have 1, 3, 5, and 7. This should help speed up processing times for jobs (NOTE: only if you have this kind of system and a comparable job layout, I'm not promising a general speed-up to everyone using cpusets). This should also allow us to properly handle hyperthreading for anyone that has it turned on and wishes to use it. The last immediate feature is if you have SMT (simultaneous multi-threading) hardware. The mom config variable $use_smt was added. By default, the use of SMT is enabled, but you can tell your pbs_mom to ignore them (not place them in the cpuset) using by adding $use_smt false to your mom config file For the future, the hwloc threads make it really easy for us to handle hardware specific requests. One of the coming features for TORQUE is to allow requests roughly similar to: socket=2:numa=2 --with-hyperthreads which would say to spread the job over 2 sockets, and across the 2 numa nodes on each socket. This is a feature we plan to add to improve support for Magny-Cours and Opteron type processors that have multiple sockets and or multiple numa nodes on the processor chip. Using hwloc makes it so we don't have to parse system files and map the indices to the sockets and/or numa nodes ourselves, we can simply use easy hwloc functions like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to just move on to the next physical core or virtual core, or skip to the next socket or numa node as the case may be. David On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > Also a better (more complete) explanation of what features are enabled > when hwloc is used would be helpful as well. > > BTW, I built torque on my server without hwloc installed and then > installed the resulting mom packages on my nodes. The mom daemons in that > case did seem to start up just fine. > -- > Steven DuChene > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Craig West > Sent: Sunday, March 18, 2012 10:40 PM > To: Torque Users mailing list; Torque Developers mailing list > Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced > > > Hi Steven, > > I have just begun testing Torque 4.0, as hwloc has been a long awaited > feature for me. > > > It is unclear from this announcement text where hwloc has to be > installed. > > Is it just on the server or on the nodes only? > > It needs to be available on the BUILD server and the nodes. I tried to > run pbs_mom on a node without the hwloc I had installed and it failed. > > Note: I am running hwloc 1.4 from a directory in /usr/local > This was not automatically found by the TORQUE configure script, but you > can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. > It embeds the locations that you specify in the pbs_mom (and other > files) but it seems you can set the LD_LIBRARY_PATH variable if it is > not in the same location on the BUILD server as the compute nodes. > For simplicity installing them in the same location makes sense. > > > More documentation about this would be greatly appreciated. > > I agree, clearer and more detailed documentation would be useful. > > Cheers, > Craig. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/ad5d6a7a/attachment-0001.html From knielson at adaptivecomputing.com Mon Mar 19 10:29:37 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 19 Mar 2012 10:29:37 -0600 Subject: [torqueusers] maui and torque not communicating In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> Message-ID: On Mon, Mar 19, 2012 at 9:51 AM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > RHEL6.1 with latest standard kernel.**** > > > Steve, We have found a problem with CentOS6 where getaddrinfo returns localhost.localdomain instead of a hostname internally in TORQUE. We currently try to authorize connections using only localhost. I am making a fix for this to 4.0.1. I could send you a patch for the 4.0 code if you want. Of course I am guessing that is the problem, but you would know with the patch if it is. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/b1de2b22/attachment.html From stevenx.a.duchene at intel.com Mon Mar 19 10:53:04 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 19 Mar 2012 16:53:04 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABD0F9@ORSMSX106.amr.corp.intel.com> Ken: Thanks for the offer of a patch. Do the symptoms you see with this match what I reported? Like I indicated it seems the communication between maui & torque is not completely functional. According to my maui log files some communication is happening but not anything even close to enough for the whole system to work as intended. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 9:30 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Mon, Mar 19, 2012 at 9:51 AM, DuChene, StevenX A > wrote: RHEL6.1 with latest standard kernel. Steve, We have found a problem with CentOS6 where getaddrinfo returns localhost.localdomain instead of a hostname internally in TORQUE. We currently try to authorize connections using only localhost. I am making a fix for this to 4.0.1. I could send you a patch for the 4.0 code if you want. Of course I am guessing that is the problem, but you would know with the patch if it is. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/d43fbfc2/attachment.html From knielson at adaptivecomputing.com Mon Mar 19 11:00:35 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 19 Mar 2012 11:00:35 -0600 Subject: [torqueusers] maui and torque not communicating In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABD0F9@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD0F9@ORSMSX106.amr.corp.intel.com> Message-ID: On Mon, Mar 19, 2012 at 10:53 AM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > Ken:**** > > Thanks for the offer of a patch. Do the symptoms you see with this match > what I reported?**** > > Like I indicated it seems the communication between maui & torque is not > completely functional.**** > > According to my maui log files some communication is happening but not > anything even close to enough for the whole system to work as intended.*** > * > > --**** > > Steven DuChene > Steven, I cannot say that they match. But when Maui communicates with TORQUE there is some name and host verification done. Ken > **** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Ken Nielson > *Sent:* Monday, March 19, 2012 9:30 AM > > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] maui and torque not communicating**** > > ** ** > > ** ** > > On Mon, Mar 19, 2012 at 9:51 AM, DuChene, StevenX A < > stevenx.a.duchene at intel.com> wrote:**** > > RHEL6.1 with latest standard kernel.**** > > ** ** > > Steve, > > We have found a problem with CentOS6 where getaddrinfo returns > localhost.localdomain instead of a hostname internally in TORQUE. We > currently try to authorize connections using only localhost. I am making a > fix for this to 4.0.1. I could send you a patch for the 4.0 code if you > want. > > Of course I am guessing that is the problem, but you would know with the > patch if it is. > > Ken**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/dcc638a3/attachment.html From nima.irt at gmail.com Mon Mar 19 10:56:12 2012 From: nima.irt at gmail.com (Nima Mohammadi) Date: Mon, 19 Mar 2012 20:26:12 +0330 Subject: [torqueusers] Only a fraction of jobs are being run Message-ID: Hi folks, It's new year's eve (Nowruz) in my country and apparently I've got the entire cluster to myself during holidays. So now that everyone is celebrating, I was going to submit my job array of hundreds of jobs to the queue. But unfortunately only a fraction of them run simultaneously,?and others get into queue waiting for running jobs to get completed. [mohammadi at server mohammadi]$ qstat Job id??????????????????? Name???????????? User??????????? Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1845.server??????????????? ME3550.10.1.job? mohammadi?????? 00:04:19 C batch 1847.server??????????????? ME3550.10.3.job? mohammadi?????? 00:04:18 C batch 1848.server??????????????? ME3550.20.1.job? mohammadi?????? 00:04:19 C batch 1850.server??????????????? ME3550.20.3.job? mohammadi?????? 00:04:20 C batch 1851.server??????????????? ME3550.30.1.job? mohammadi????????????? 0 R batch 1852.server??????????????? ME3550.30.2.job? mohammadi????????????? 0 Q batch 1853.server??????????????? ME3550.30.3.job? mohammadi????????????? 0 Q batch 1854.server??????????????? ME3560.10.1.job? mohammadi????????????? 0 Q batch .... 1961.server??????????????? ME3770.30.3.job? mohammadi????????????? 0 R batch 1962.server??????????????? ME3780.10.1.job? mohammadi????????????? 0 Q batch 1963.server??????????????? ME3780.10.2.job? mohammadi????????????? 0 R batch 1964.server??????????????? ME3780.10.3.job? mohammadi????????????? 0 Q batch .... At first, I got suspicious that a slot limit is set, but checking the configurations with qmgr showed no max_slot_limit: Qmgr: list server Server server ??? server_state = Active ??? scheduling = True ??? total_jobs = 213 ??? state_count = Transit:0 Queued:203 Held:0 Waiting:0 Running:10 Exiting:0 ??? acl_hosts = server ??? acl_roots = root@* ??? managers = cartoonist at server,ghader at server,mohammadi at server, ?????????????????? root at server ??? operators = cartoonist at server,ghader at server,mohammadi at server, ??????????????????? root at server,seyedi at server ??? default_queue = batch ??? log_events = 511 ??? mail_from = adm ??? resources_assigned.mem = 0b ??? resources_assigned.nodect = 10 ??? scheduler_iteration = 600 ??? node_check_rate = 150 ??? tcp_timeout = 6 ??? mom_job_sync = True ??? pbs_version = 3.0.1 ??? keep_completed = 10 ??? next_job_number = 2070 ??? net_counter = 2 1 0 ??? record_job_info = True ??? job_log_file_max_size = 10000 ??? job_log_file_roll_depth = 5 ??? job_log_keep_days = 10 Queue batch queue_type = Execution total_jobs = 213 state_count = Transit:0 Queued:202 Held:0 Waiting:0 Running:10 Exiting:0 resources_max.cput = 20000:00:00 resources_min.cput = 00:00:01 resources_default.cput = 10000:00:00 resources_default.nodes = 2 resources_default.walltime = 100:00:00 mtime = Mon Mar 19 18:53:51 2012 resources_assigned.mem = 0b resources_assigned.nodect = 10 enabled = True started = True Checking with the pbsnodes command, there are 14 nodes up and running with 246 processors. My problem is an 'embarrassingly parallel' workload and there's no dependency among the jobs. The batch scripts are generated using the Python script below: from subprocess import call import os script=''' #!/bin/sh #PBS -l nodes=1:ppn=1,walltime=00:06:00 #PBS -o /dev/null #PBS -e /dev/null #PBS -q batch #PBS -M nima.irt at gmail.com #PBS -m abe source /share/mohammadi/nima/mydevenv/bin/activate cd /share/mohammadi/nima/AI/ python ME-cluster.py %d %d %d %2.1f %2.1f ''' total_experts = 3 for gating_hidden in xrange (5, 10): for experts_hidden in xrange(5, 10): for gating_N in [x * 0.1 for x in range(1, 4)]: for experts_N in [x * 0.1 for x in range(1, 4)]: script_name = '/tmp/ME%d%d%d%2.1f%2.1f.job' % (total_experts, gating_hidden, experts_hidden, gating_N, experts_N) with open(script_name,'w') as scriptf: scriptf.write(script % (total_experts, gating_hidden, experts_hidden, gating_N, experts_N)) call(["qsub", script_name]) Any help would be highly appreciated :) -- Nima Mohammadi From stevenx.a.duchene at intel.com Mon Mar 19 11:24:33 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 19 Mar 2012 17:24:33 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD0F9@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABD12F@ORSMSX106.amr.corp.intel.com> Would there be any symptoms in either the torque or maui log files that I can look for that would match any of the issues that this patch would address? I don't recall seeing anything about host verification errors in either place. I am seeing this sort of error in the torque server log files: 03/19/2012 10:11:06;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::process_pbs_server_port, Socket (10) close detected from 36.101.8.27:15004 Is that a pointer that would reference this problem? -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 10:01 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Mon, Mar 19, 2012 at 10:53 AM, DuChene, StevenX A > wrote: Ken: Thanks for the offer of a patch. Do the symptoms you see with this match what I reported? Like I indicated it seems the communication between maui & torque is not completely functional. According to my maui log files some communication is happening but not anything even close to enough for the whole system to work as intended. -- Steven DuChene Steven, I cannot say that they match. But when Maui communicates with TORQUE there is some name and host verification done. Ken From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 9:30 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Mon, Mar 19, 2012 at 9:51 AM, DuChene, StevenX A > wrote: RHEL6.1 with latest standard kernel. Steve, We have found a problem with CentOS6 where getaddrinfo returns localhost.localdomain instead of a hostname internally in TORQUE. We currently try to authorize connections using only localhost. I am making a fix for this to 4.0.1. I could send you a patch for the 4.0 code if you want. Of course I am guessing that is the problem, but you would know with the patch if it is. Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/3cfb5b7d/attachment.html From stevenx.a.duchene at intel.com Mon Mar 19 11:35:17 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 19 Mar 2012 17:35:17 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABD12F@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD0F9@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD12F@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABD141@ORSMSX106.amr.corp.intel.com> BTW, the next log entry right after this is: 03/19/2012 10:11:06;0002;PBS_Server;node;close_conn;Connection 10 - func 4403a0 Now I don't know if this is related to my non-communication issue or not. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Monday, March 19, 2012 10:25 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating Would there be any symptoms in either the torque or maui log files that I can look for that would match any of the issues that this patch would address? I don't recall seeing anything about host verification errors in either place. I am seeing this sort of error in the torque server log files: 03/19/2012 10:11:06;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::process_pbs_server_port, Socket (10) close detected from 36.101.8.27:15004 Is that a pointer that would reference this problem? -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 10:01 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Mon, Mar 19, 2012 at 10:53 AM, DuChene, StevenX A > wrote: Ken: Thanks for the offer of a patch. Do the symptoms you see with this match what I reported? Like I indicated it seems the communication between maui & torque is not completely functional. According to my maui log files some communication is happening but not anything even close to enough for the whole system to work as intended. -- Steven DuChene Steven, I cannot say that they match. But when Maui communicates with TORQUE there is some name and host verification done. Ken From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 9:30 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Mon, Mar 19, 2012 at 9:51 AM, DuChene, StevenX A > wrote: RHEL6.1 with latest standard kernel. Steve, We have found a problem with CentOS6 where getaddrinfo returns localhost.localdomain instead of a hostname internally in TORQUE. We currently try to authorize connections using only localhost. I am making a fix for this to 4.0.1. I could send you a patch for the 4.0 code if you want. Of course I am guessing that is the problem, but you would know with the patch if it is. Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/5b70a10f/attachment-0001.html From nima.irt at gmail.com Mon Mar 19 12:59:02 2012 From: nima.irt at gmail.com (Nima Mohammadi) Date: Mon, 19 Mar 2012 22:29:02 +0330 Subject: [torqueusers] Only a fraction of jobs are being run In-Reply-To: References: Message-ID: On Mon, Mar 19, 2012 at 8:26 PM, Nima Mohammadi wrote: > Hi folks, > It's new year's eve (Nowruz) in my country and apparently I've got the > entire cluster to myself during holidays. So now that everyone is > celebrating, I was going to submit my job array of hundreds of jobs to > the queue. But unfortunately only a fraction of them run > simultaneously,?and others get into queue waiting for running jobs to > get completed. > > > Queue batch > ? ? ? ?queue_type = Execution > ? ? ? ?total_jobs = 213 > ? ? ? ?state_count = Transit:0 Queued:202 Held:0 Waiting:0 Running:10 Exiting:0 > ? ? ? ?resources_max.cput = 20000:00:00 > ? ? ? ?resources_min.cput = 00:00:01 > ? ? ? ?resources_default.cput = 10000:00:00 > ? ? ? ?resources_default.nodes = 2 > ? ? ? ?resources_default.walltime = 100:00:00 > ? ? ? ?mtime = Mon Mar 19 18:53:51 2012 > ? ? ? ?resources_assigned.mem = 0b > ? ? ? ?resources_assigned.nodect = 10 > ? ? ? ?enabled = True > ? ? ? ?started = True > > -- Nima Mohammadi I guess my problem has something to do with the value of resources_assigned.nodect. But even though that I'm in manager group, I don't seem to be able the change its value: Qmgr: set queue batch resources_assigned.nodect = 20 qmgr obj=batch svr=default: Cannot set attribute, read only or insufficient permission resources_assigned.nodect I also created another queue. I could change any parameters of the queue I could think of, except resources_assigned.nodect. -- Nima Mohammadi From jfarran at uci.edu Mon Mar 19 16:21:27 2012 From: jfarran at uci.edu (Joseph Farran) Date: Mon, 19 Mar 2012 15:21:27 -0700 Subject: [torqueusers] #PBS -V in version 2.5.10 Message-ID: <4F67B167.5020909@uci.edu> Hello. We were using Torque 2.5.9 and we were able to use the Torque PBS directive "#PBS -V" just fine. On upgrading to Torque 2.5.10, the same scripts which used to work using "#PBS -V" no longer work. When we submit a job using "#PBS -V", the job starts and nothing happens - no output, no errors, nothing. The job starts but nothing happens. Looking at Torque logs /opt/torque/server_logs shows no errors - just the job starting and ending. If we remove ""#PBS -V" then the job runs just fine. Anyone else ran into this or knows what is going on? Thanks, Joseph From knielson at adaptivecomputing.com Mon Mar 19 16:32:02 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 19 Mar 2012 16:32:02 -0600 Subject: [torqueusers] #PBS -V in version 2.5.10 In-Reply-To: <4F67B167.5020909@uci.edu> References: <4F67B167.5020909@uci.edu> Message-ID: On Mon, Mar 19, 2012 at 4:21 PM, Joseph Farran wrote: > Hello. > > We were using Torque 2.5.9 and we were able to use the Torque PBS > directive "#PBS -V" just fine. > > On upgrading to Torque 2.5.10, the same scripts which used to work using > "#PBS -V" no longer work. > > When we submit a job using "#PBS -V", the job starts and nothing happens - > no output, no errors, nothing. The job starts but nothing happens. > > Looking at Torque logs /opt/torque/server_logs shows no errors - just the > job starting and ending. > > If we remove ""#PBS -V" then the job runs just fine. > > Anyone else ran into this or knows what is going on? > > Thanks, > Joseph > > Did you have any array jobs in your queue when you upgraded? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/2ce416bd/attachment.html From jfarran at uci.edu Mon Mar 19 16:42:44 2012 From: jfarran at uci.edu (Joseph Farran) Date: Mon, 19 Mar 2012 15:42:44 -0700 Subject: [torqueusers] #PBS -V in version 2.5.10 In-Reply-To: References: <4F67B167.5020909@uci.edu> Message-ID: <4F67B664.5040703@uci.edu> Hi Ken. Yes. One of our users has job arrays which is the person experiencing this problem. I deleted all jobs prior to upgrading. Is there something I forgot go clean out that needs cleaning? Joseph On 03/19/2012 03:32 PM, Ken Nielson wrote: > On Mon, Mar 19, 2012 at 4:21 PM, Joseph Farran > wrote: > > Hello. > > We were using Torque 2.5.9 and we were able to use the Torque PBS directive "#PBS -V" just fine. > > On upgrading to Torque 2.5.10, the same scripts which used to work using "#PBS -V" no longer work. > > When we submit a job using "#PBS -V", the job starts and nothing happens - no output, no errors, nothing. The job starts but nothing happens. > > Looking at Torque logs /opt/torque/server_logs shows no errors - just the job starting and ending. > > If we remove ""#PBS -V" then the job runs just fine. > > Anyone else ran into this or knows what is going on? > > Thanks, > Joseph > > Did you have any array jobs in your queue when you upgraded? > > Ken From stevenx.a.duchene at intel.com Mon Mar 19 16:46:34 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 19 Mar 2012 22:46:34 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABD1FF@ORSMSX106.amr.corp.intel.com> Ken: BTW, in case I was not clear, I would like to try your hostname resolution patch for the 4.0 code to see if this resolves the issue with the maui - torque communication issue I am seeing. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 9:30 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Mon, Mar 19, 2012 at 9:51 AM, DuChene, StevenX A > wrote: RHEL6.1 with latest standard kernel. Steve, We have found a problem with CentOS6 where getaddrinfo returns localhost.localdomain instead of a hostname internally in TORQUE. We currently try to authorize connections using only localhost. I am making a fix for this to 4.0.1. I could send you a patch for the 4.0 code if you want. Of course I am guessing that is the problem, but you would know with the patch if it is. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120319/1eae4e8a/attachment.html From sm4082 at nyu.edu Mon Mar 19 17:37:14 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Mon, 19 Mar 2012 19:37:14 -0400 Subject: [torqueusers] #PBS -V in version 2.5.10 In-Reply-To: <4F67B664.5040703@uci.edu> References: <4F67B167.5020909@uci.edu> <4F67B664.5040703@uci.edu> Message-ID: <48ADB291-68A1-49F2-AE1D-97FC17109D0E@nyu.edu> We are also having this problem. Serious problem with this version is some pbs variables are not being defined (PBS_JOBNAME PBS_JOBID). This is the reason you don't see err and out files ( I am assuming user has these variables in pbs -e and -o directives). If you have compiled torque with --enable-syslog you can see in the logs on compute nodes that it can't create them since variables are undefined. I asked users to mention absolute path. For parallel jobs and array jobs I am sourcing a script file through wrapper. This script file defines pbs_nodefile that is needed for parallel jobs and array id for array jobs. Strangely, if I restart pbs_mom it works ok for the user who had failed jobs before. But after a while it happens all again for different user. I checked 2.5.11 and there are not that many differences between this and 2.5.10. Not sure upgrading to 11 would solve this problem. Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Mar 19, 2012, at 18:42, Joseph Farran wrote: > Hi Ken. > > Yes. One of our users has job arrays which is the person experiencing this problem. I deleted all jobs prior to upgrading. > > Is there something I forgot go clean out that needs cleaning? > > Joseph > > > On 03/19/2012 03:32 PM, Ken Nielson wrote: >> On Mon, Mar 19, 2012 at 4:21 PM, Joseph Farran > wrote: >> >> Hello. >> >> We were using Torque 2.5.9 and we were able to use the Torque PBS directive "#PBS -V" just fine. >> >> On upgrading to Torque 2.5.10, the same scripts which used to work using "#PBS -V" no longer work. >> >> When we submit a job using "#PBS -V", the job starts and nothing happens - no output, no errors, nothing. The job starts but nothing happens. >> >> Looking at Torque logs /opt/torque/server_logs shows no errors - just the job starting and ending. >> >> If we remove ""#PBS -V" then the job runs just fine. >> >> Anyone else ran into this or knows what is going on? >> >> Thanks, >> Joseph >> >> Did you have any array jobs in your queue when you upgraded? >> >> Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Mon Mar 19 17:43:38 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 20 Mar 2012 10:43:38 +1100 Subject: [torqueusers] Only a fraction of jobs are being run In-Reply-To: References: Message-ID: <4F67C4AA.2080904@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Nima, On 20/03/12 03:56, Nima Mohammadi wrote: > But unfortunately only a fraction of them run simultaneously, and > others get into queue waiting for running jobs to get completed. What scheduler are you using there? It'll be one of pbs_sched, maui or moab. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9nxKoACgkQO2KABBYQAh9tEgCdHQvySPyBb4HYFRmcohU1K/Wq sR8An1juT7EMA8Y3KKL4SaArIXeAWZ1t =TaRI -----END PGP SIGNATURE----- From cwest at vpac.org Mon Mar 19 20:48:02 2012 From: cwest at vpac.org (Craig West) Date: Tue, 20 Mar 2012 13:48:02 +1100 Subject: [torqueusers] TORQUE 4.0 Officially Announced In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> Message-ID: <4F67EFE2.1050602@vpac.org> Steven, > BTW, I built torque on my server without hwloc installed and then > installed the resulting mom packages on my nodes. The mom daemons in > that case did seem to start up just fine. HWLOC is only needed if you are using cpusets. If you don't use cpuset then you won't need HWLOC on your nodes either. Cheers, Craig. From nima.irt at gmail.com Mon Mar 19 23:51:40 2012 From: nima.irt at gmail.com (Nima Mohammadi) Date: Tue, 20 Mar 2012 09:21:40 +0330 Subject: [torqueusers] Only a fraction of jobs are being run In-Reply-To: <4F67C4AA.2080904@unimelb.edu.au> References: <4F67C4AA.2080904@unimelb.edu.au> Message-ID: We use Maui. [mohammadi at server ~]$ service pbs_sched status pbs_sched is stopped [mohammadi at server ~]$ service maui.d status /etc/init.d/maui.d: line 8: ulimit: open files: cannot modify limit: Operation not permitted maui (pid 2394) is running... Moreover, when I define multiple queues and submit jobs to them, the overall number of running jobs stays the same. -- Nima Mohammadi On Mar 20, 2012, at 3:13 AM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Nima, > > On 20/03/12 03:56, Nima Mohammadi wrote: > >> But unfortunately only a fraction of them run simultaneously, and >> others get into queue waiting for running jobs to get completed. > > What scheduler are you using there? > > It'll be one of pbs_sched, maui or moab. > > cheers! > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk9nxKoACgkQO2KABBYQAh9tEgCdHQvySPyBb4HYFRmcohU1K/Wq > sR8An1juT7EMA8Y3KKL4SaArIXeAWZ1t > =TaRI > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ngsbioinformatics at gmail.com Tue Mar 20 09:50:19 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Tue, 20 Mar 2012 11:50:19 -0400 Subject: [torqueusers] interactive jobs abruptly end Message-ID: Hi all - I noticed that whenever I run an interactive job using 'qsub -I', sometimes I can be in the middle of typing something and the job abruptly ends. Googling didn't turn anything up about this. Non-interactive jobs run fine. I've never seen this before. Ideas? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120320/1fc8d637/attachment.html From sm4082 at nyu.edu Tue Mar 20 10:12:38 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Tue, 20 Mar 2012 12:12:38 -0400 Subject: [torqueusers] interactive jobs abruptly end In-Reply-To: References: Message-ID: <433B1A98-D106-445E-AC40-35E1E0C99260@nyu.edu> Hi, Which version of torque are you running? We had 2.5.8 and it had the same problem. This issue was fixed in 2.5.10 I think. But we're facing some other issues with 2.5.10. Sreedhar. On Mar 20, 2012, at 11:50 AM, Ryan Golhar wrote: > Hi all - I noticed that whenever I run an interactive job using 'qsub -I', sometimes I can be in the middle of typing something and the job abruptly ends. Googling didn't turn anything up about this. Non-interactive jobs run fine. I've never seen this before. Ideas? _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ngsbioinformatics at gmail.com Tue Mar 20 10:14:28 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Tue, 20 Mar 2012 12:14:28 -0400 Subject: [torqueusers] interactive jobs abruptly end In-Reply-To: <433B1A98-D106-445E-AC40-35E1E0C99260@nyu.edu> References: <433B1A98-D106-445E-AC40-35E1E0C99260@nyu.edu> Message-ID: I'm using 2.5.9. Hmmm, guess I have to upgrade. Is there a torque roll for 2.5.10 of higher already available? On Tue, Mar 20, 2012 at 12:12 PM, Sreedhar Manchu wrote: > Hi, > > Which version of torque are you running? We had 2.5.8 and it had the same > problem. This issue was fixed in 2.5.10 I think. But we're facing some > other issues with 2.5.10. > > Sreedhar. > > On Mar 20, 2012, at 11:50 AM, Ryan Golhar wrote: > > > Hi all - I noticed that whenever I run an interactive job using 'qsub > -I', sometimes I can be in the middle of typing something and the job > abruptly ends. Googling didn't turn anything up about this. > Non-interactive jobs run fine. I've never seen this before. Ideas? > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120320/799027fe/attachment.html From ngsbioinformatics at gmail.com Tue Mar 20 10:15:17 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Tue, 20 Mar 2012 12:15:17 -0400 Subject: [torqueusers] interactive jobs abruptly end In-Reply-To: References: <433B1A98-D106-445E-AC40-35E1E0C99260@nyu.edu> Message-ID: Nevermind. I see 5.2.0 is available. On Tue, Mar 20, 2012 at 12:14 PM, Ryan Golhar wrote: > I'm using 2.5.9. Hmmm, guess I have to upgrade. Is there a torque roll > for 2.5.10 of higher already available? > > > On Tue, Mar 20, 2012 at 12:12 PM, Sreedhar Manchu wrote: > >> Hi, >> >> Which version of torque are you running? We had 2.5.8 and it had the same >> problem. This issue was fixed in 2.5.10 I think. But we're facing some >> other issues with 2.5.10. >> >> Sreedhar. >> >> On Mar 20, 2012, at 11:50 AM, Ryan Golhar wrote: >> >> > Hi all - I noticed that whenever I run an interactive job using 'qsub >> -I', sometimes I can be in the middle of typing something and the job >> abruptly ends. Googling didn't turn anything up about this. >> Non-interactive jobs run fine. I've never seen this before. Ideas? >> _______________________________________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120320/0dab428e/attachment-0001.html From stephen.b.weston at gmail.com Tue Mar 20 10:02:46 2012 From: stephen.b.weston at gmail.com (Stephen Weston) Date: Tue, 20 Mar 2012 12:02:46 -0400 Subject: [torqueusers] interactive jobs abruptly end In-Reply-To: References: Message-ID: This sounds like a bug that we noticed in Torque 2.5.8. It was fixed in Torque 2.5.10, I believe. - Steve On Tue, Mar 20, 2012 at 11:50 AM, Ryan Golhar wrote: > Hi all - I noticed that whenever I run an interactive job using 'qsub -I', > sometimes I can be in the middle of typing something and the job abruptly > ends. ?Googling didn't turn anything up about this. ?Non-interactive jobs > run fine. ?I've never seen this before. ?Ideas? > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From samuel at unimelb.edu.au Tue Mar 20 17:36:48 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 21 Mar 2012 10:36:48 +1100 Subject: [torqueusers] Only a fraction of jobs are being run In-Reply-To: References: <4F67C4AA.2080904@unimelb.edu.au> Message-ID: <4F691490.1030302@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/03/12 16:51, Nima Mohammadi wrote: > Moreover, when I define multiple queues and submit jobs to them, > the overall number of running jobs stays the same. The issues you'll be seeing will be because of policies in Maui then, not in Torque. Probably something like: USERCFG[DEFAULT] MAXPROC=512 USERCFG[DEFAULT] MAXJOB=512 USERCFG[DEFAULT] MAXIJOB=5 That's taken from our Moab config, but I'm pretty sure it'll be the same for Maui too. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9pFI8ACgkQO2KABBYQAh/WLwCdEtVmp9V3iiIbuR3b1qEydHJv B1oAn17KOFJ+9fh2RfKVi/z6OMRI44C8 =laCa -----END PGP SIGNATURE----- From stevenx.a.duchene at intel.com Tue Mar 20 17:44:52 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 20 Mar 2012 23:44:52 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABD1FF@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD1FF@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABD497@ORSMSX106.amr.corp.intel.com> Ken: You might have to send the patch to a different E-mail address as my Intel address might filter out a patch depending on the filename extension. I have not seen anything from you yet so I am wondering if it might have gotten "lost" in the "tubes" :) -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Monday, March 19, 2012 3:47 PM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating Ken: BTW, in case I was not clear, I would like to try your hostname resolution patch for the 4.0 code to see if this resolves the issue with the maui - torque communication issue I am seeing. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, March 19, 2012 9:30 AM To: Torque Users Mailing List Subject: Re: [torqueusers] maui and torque not communicating On Mon, Mar 19, 2012 at 9:51 AM, DuChene, StevenX A > wrote: RHEL6.1 with latest standard kernel. Steve, We have found a problem with CentOS6 where getaddrinfo returns localhost.localdomain instead of a hostname internally in TORQUE. We currently try to authorize connections using only localhost. I am making a fix for this to 4.0.1. I could send you a patch for the 4.0 code if you want. Of course I am guessing that is the problem, but you would know with the patch if it is. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120320/51a4a7be/attachment.html From samuel at unimelb.edu.au Tue Mar 20 19:54:43 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 21 Mar 2012 12:54:43 +1100 Subject: [torqueusers] maui and torque not communicating In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABD497@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD1FF@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD497@ORSMSX106.amr.corp.intel.com> Message-ID: <4F6934E3.2090501@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 21/03/12 10:44, DuChene, StevenX A wrote: > You might have to send the patch to a different E-mail address as > my Intel address might filter out a patch depending on the > filename extension. I have not seen anything from you yet so I am > wondering if it might have gotten ?lost? in the ?tubes? I did take a quick look in my git svn clone, but I don't see anything obvious there I'm afraid.. :-( - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9pNOMACgkQO2KABBYQAh8KVgCdFKha20g5Shehsjt7iwyL4Q5Z uogAmwd+mAkk1pm6+0ZG6Vr0I0vpaRi/ =nR87 -----END PGP SIGNATURE----- From chrisbee at uw.edu Tue Mar 20 15:17:51 2012 From: chrisbee at uw.edu (Chris Berthiaume) Date: Tue, 20 Mar 2012 14:17:51 -0700 Subject: [torqueusers] kill_delay in torque 2.5.9 Message-ID: Hello, I'm trying to get an extended kill_delay working with torque 2.5.9, but so far I haven't been able to exceed a 5 second delay between SIGTERM and SIGKILL. After reading various mailing list entries it looks like this issue has been encountered in the past and with 2.5.9 it should be possible to set a longer kill_delay. Here's how I've configured pbs_server and pbs_mom. $ qmgr -c 'print queue gross' # # Create queues and set their attributes. # # # Create and define queue gross # create queue gross set queue gross queue_type = Execution set queue gross resources_default.neednodes = gross set queue gross kill_delay = 30 set queue gross enabled = True set queue gross started = True $ cat /opt/torque/mom_priv/config $ignwalltime false $kill_delay true To test these settings I run a submit script that traps SIGTERM and in that trap prints the date every second. Then I issue a qdel for this job. Only 5 seconds worth of date output from the SIGTERM trap function appears. Is there anything more I need to do to enable kill_delay? I gather it's pbs_mom which is subverting the server kill_delay and sending SIGKILL to the job after 5 seconds, but the undocumented mom config option "$kill_delay true" should override this. Here's my submit script. #!/bin/bash function termtrap() { while true; do date sleep 1 done } trap termtrap SIGTERM sleep 600 Thanks, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120320/a2b1ac34/attachment-0001.html From shantanugadgil at yahoo.com Tue Mar 20 22:32:01 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Tue, 20 Mar 2012 21:32:01 -0700 (PDT) Subject: [torqueusers] TORQUE 4.0.0 does not allow submitting jobs as root Message-ID: <1332304321.21675.YahooMailClassic@web120605.mail.ne1.yahoo.com> Hi, I tried to allow job submission by root by executing: $> qmgr -c 's s acl_roots += root@*' ... but it doesn't seem to work! Any clue as to what else I might be missing? The same steps work for torque 3.0.4 --- snip --- [root at server ~]# cat /etc/centos-release CentOS release 6.2 (Final) [root at server ~]# uname -a Linux server 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux [root at server ~]# ls -l torque-4.0.0.tar.gz -rwx------. 1 root root 6103619 Mar 21 01:11 torque-4.0.0.tar.gz [root at server ~]# rpmbuild --with gui -tb torque-4.0.0.tar.gz [root at server torque-4.0.0]# rpm -Uvh /root/rpmbuild/RPMS/x86_64/torque-*.x86_64.rpm [root at server torque-4.0.0]# tar xvf torque-4.0.0.tar.gz [root at server torque-4.0.0]# ./torque.setup root initializing TORQUE (admin: root at server) You have selected to start pbs_server in create mode. If the server database exists it will be overwritten. do you wish to continue y/(n)?y Max open servers: 9 set server operators += root at server Max open servers: 9 set server managers += root at server [root at server torque-4.0.0]# qterm [root at server torque-4.0.0]# /etc/init.d/pbs_server start [root at server torque-4.0.0]# qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = server set server acl_roots = root at server set server acl_roots += root@* set server managers = root at server set server operators = root at server set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 0 set server moab_array_compatible = True [root at server ~]# echo "sleep 100" | qsub qsub can not be run as root --- snip --- From knielson at adaptivecomputing.com Wed Mar 21 11:06:47 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 21 Mar 2012 11:06:47 -0600 Subject: [torqueusers] TORQUE 4.0.0 does not allow submitting jobs as root In-Reply-To: <1332304321.21675.YahooMailClassic@web120605.mail.ne1.yahoo.com> References: <1332304321.21675.YahooMailClassic@web120605.mail.ne1.yahoo.com> Message-ID: We know about this problem on CentOS. It has to do with getaddrinfo(3) and how it resolves localhost. I will let you know when we have it fixed. Ken On Tue, Mar 20, 2012 at 10:32 PM, Shantanu Gadgil wrote: > Hi, > > I tried to allow job submission by root by executing: > > $> qmgr -c 's s acl_roots += root@*' > > ... but it doesn't seem to work! Any clue as to what else I might be > missing? The same steps work for torque 3.0.4 > > --- snip --- > [root at server ~]# cat /etc/centos-release > CentOS release 6.2 (Final) > > [root at server ~]# uname -a > Linux server 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 > x86_64 x86_64 x86_64 GNU/Linux > > [root at server ~]# ls -l torque-4.0.0.tar.gz > -rwx------. 1 root root 6103619 Mar 21 01:11 torque-4.0.0.tar.gz > > [root at server ~]# rpmbuild --with gui -tb torque-4.0.0.tar.gz > > [root at server torque-4.0.0]# rpm -Uvh > /root/rpmbuild/RPMS/x86_64/torque-*.x86_64.rpm > > [root at server torque-4.0.0]# tar xvf torque-4.0.0.tar.gz > > [root at server torque-4.0.0]# ./torque.setup root > initializing TORQUE (admin: root at server) > > You have selected to start pbs_server in create mode. > If the server database exists it will be overwritten. > do you wish to continue y/(n)?y > Max open servers: 9 > set server operators += root at server > Max open servers: 9 > set server managers += root at server > > [root at server torque-4.0.0]# qterm > [root at server torque-4.0.0]# /etc/init.d/pbs_server start > > [root at server torque-4.0.0]# qmgr -c 'p s' > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = server > set server acl_roots = root at server > set server acl_roots += root@* > set server managers = root at server > set server operators = root at server > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 300 > set server job_stat_rate = 45 > set server poll_jobs = True > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 0 > set server moab_array_compatible = True > > [root at server ~]# echo "sleep 100" | qsub > qsub can not be run as root > > --- snip --- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120321/539085ad/attachment.html From knielson at adaptivecomputing.com Wed Mar 21 11:30:33 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 21 Mar 2012 11:30:33 -0600 Subject: [torqueusers] Requeue job if node fails In-Reply-To: <1331738654.80908.YahooMailNeo@web114713.mail.gq1.yahoo.com> References: <1331738654.80908.YahooMailNeo@web114713.mail.gq1.yahoo.com> Message-ID: Yes. Restart pbs_mom with a -q option and it will requeue all jobs that were running when the node failed. Ken On Wed, Mar 14, 2012 at 9:24 AM, Calin Ilis wrote: > Hi, > > Is it possible using torque/maui to requeu a job that was executing on a > node which failed. My jobs are single node jobs. So the failed node is the > mother superior node. > > Thanks > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120321/3e519d05/attachment.html From avb at ssau.ru Thu Mar 22 02:54:18 2012 From: avb at ssau.ru (Alexandr Baskakov) Date: Thu, 22 Mar 2012 12:54:18 +0400 Subject: [torqueusers] LOG_ERROR::svr_recov_xml, No server tag found in the database file??? Message-ID: <4F6AE8BA.4010505@ssau.ru> Hi all. I just try to test Torque 4.0.0, and have this error after restarting pbs_server. # service pbs_server start Starting TORQUE Server: PBS_Server: LOG_ERROR::svr_recov_xml, No server tag found in the database file??? PBS_Server: LOG_ERROR::recov_svr_attr, Unable to read server database /usr/sbin/pbs_server: failed to get server attributes [FAILED] After clean install and first start/create serverdb everything works ok. Only after stop and trying to start have this error. OS: RHEL 6.2 -- Alexandr Baskakov, Samara State Aerospace University e-mail: avb at ssau.ru From jfarran at uci.edu Thu Mar 22 16:06:19 2012 From: jfarran at uci.edu (Joseph Farran) Date: Thu, 22 Mar 2012 15:06:19 -0700 Subject: [torqueusers] #PBS -V in version 2.5.10 In-Reply-To: <48ADB291-68A1-49F2-AE1D-97FC17109D0E@nyu.edu> References: <4F67B167.5020909@uci.edu> <4F67B664.5040703@uci.edu> <48ADB291-68A1-49F2-AE1D-97FC17109D0E@nyu.edu> Message-ID: <4F6BA25B.4030803@uci.edu> Sreedhar, are you using Rocks 5.4.3 by any change? The "#PBS -V" was *NOT* an issue with Torque after all but rather a Rocks BUG. We are using Rocks 5.4.3 and after applying this fix: http://groups.google.com/group/rocks-clusters/browse_thread/thread/d56541d7755438c7/8813bf8ed30a66d1?fwc=2&pli=1 The "#PBS -V" works just fine and as expected. Hope this helps, Joseph On 03/19/2012 04:37 PM, Sreedhar Manchu wrote: > We are also having this problem. Serious problem with this version is some pbs variables are not being defined (PBS_JOBNAME PBS_JOBID). This is the reason you don't see err and out files ( I am assuming user has these variables in pbs -e and -o directives). If you have compiled torque with --enable-syslog you can see in the logs on compute nodes that it can't create them since variables are undefined. > > I asked users to mention absolute path. For parallel jobs and array jobs I am sourcing a script file through wrapper. This script file defines pbs_nodefile that is needed for parallel jobs and array id for array jobs. > > Strangely, if I restart pbs_mom it works ok for the user who had failed jobs before. But after a while it happens all again for different user. I checked 2.5.11 and there are not that many differences between this and 2.5.10. Not sure upgrading to 11 would solve this problem. > > Sreedhar. > > -- > Sent from my phone. Please excuse my brevity and any typos. > > On Mar 19, 2012, at 18:42, Joseph Farran wrote: > >> Hi Ken. >> >> Yes. One of our users has job arrays which is the person experiencing this problem. I deleted all jobs prior to upgrading. >> >> Is there something I forgot go clean out that needs cleaning? >> >> Joseph >> >> >> On 03/19/2012 03:32 PM, Ken Nielson wrote: >>> On Mon, Mar 19, 2012 at 4:21 PM, Joseph Farran> wrote: >>> >>> Hello. >>> >>> We were using Torque 2.5.9 and we were able to use the Torque PBS directive "#PBS -V" just fine. >>> >>> On upgrading to Torque 2.5.10, the same scripts which used to work using "#PBS -V" no longer work. >>> >>> When we submit a job using "#PBS -V", the job starts and nothing happens - no output, no errors, nothing. The job starts but nothing happens. >>> >>> Looking at Torque logs /opt/torque/server_logs shows no errors - just the job starting and ending. >>> >>> If we remove ""#PBS -V" then the job runs just fine. >>> >>> Anyone else ran into this or knows what is going on? >>> >>> Thanks, >>> Joseph >>> >>> Did you have any array jobs in your queue when you upgraded? >>> >>> Ken >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > From sm4082 at nyu.edu Thu Mar 22 18:08:07 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 22 Mar 2012 20:08:07 -0400 Subject: [torqueusers] #PBS -V in version 2.5.10 In-Reply-To: <4F6BA25B.4030803@uci.edu> References: <4F67B167.5020909@uci.edu> <4F67B664.5040703@uci.edu> <48ADB291-68A1-49F2-AE1D-97FC17109D0E@nyu.edu> <4F6BA25B.4030803@uci.edu> Message-ID: <0EE53879-41A4-4324-BEB9-936194A05335@nyu.edu> Thanks, Joseph. But we are using rocks 5.1. In fact, we don't have any problem with -V flag. It is just that sometimes pbs variables are not being defined by torque. It is very random the way it's been happening. PBS_NODEFILE, PBS_JOBID and PBS_JOBNAME and few more variables are missing. I am looking into code to try to understand what is happening. Thanks Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Mar 22, 2012, at 18:06, Joseph Farran wrote: > Sreedhar, are you using Rocks 5.4.3 by any change? > > The "#PBS -V" was *NOT* an issue with Torque after all but rather a Rocks BUG. > > We are using Rocks 5.4.3 and after applying this fix: > > http://groups.google.com/group/rocks-clusters/browse_thread/thread/d56541d7755438c7/8813bf8ed30a66d1?fwc=2&pli=1 > > The "#PBS -V" works just fine and as expected. > > Hope this helps, > Joseph > > > On 03/19/2012 04:37 PM, Sreedhar Manchu wrote: >> We are also having this problem. Serious problem with this version is some pbs variables are not being defined (PBS_JOBNAME PBS_JOBID). This is the reason you don't see err and out files ( I am assuming user has these variables in pbs -e and -o directives). If you have compiled torque with --enable-syslog you can see in the logs on compute nodes that it can't create them since variables are undefined. >> >> I asked users to mention absolute path. For parallel jobs and array jobs I am sourcing a script file through wrapper. This script file defines pbs_nodefile that is needed for parallel jobs and array id for array jobs. >> >> Strangely, if I restart pbs_mom it works ok for the user who had failed jobs before. But after a while it happens all again for different user. I checked 2.5.11 and there are not that many differences between this and 2.5.10. Not sure upgrading to 11 would solve this problem. >> >> Sreedhar. >> >> -- >> Sent from my phone. Please excuse my brevity and any typos. >> >> On Mar 19, 2012, at 18:42, Joseph Farran wrote: >> >>> Hi Ken. >>> >>> Yes. One of our users has job arrays which is the person experiencing this problem. I deleted all jobs prior to upgrading. >>> >>> Is there something I forgot go clean out that needs cleaning? >>> >>> Joseph >>> >>> >>> On 03/19/2012 03:32 PM, Ken Nielson wrote: >>>> On Mon, Mar 19, 2012 at 4:21 PM, Joseph Farran> wrote: >>>> >>>> Hello. >>>> >>>> We were using Torque 2.5.9 and we were able to use the Torque PBS directive "#PBS -V" just fine. >>>> >>>> On upgrading to Torque 2.5.10, the same scripts which used to work using "#PBS -V" no longer work. >>>> >>>> When we submit a job using "#PBS -V", the job starts and nothing happens - no output, no errors, nothing. The job starts but nothing happens. >>>> >>>> Looking at Torque logs /opt/torque/server_logs shows no errors - just the job starting and ending. >>>> >>>> If we remove ""#PBS -V" then the job runs just fine. >>>> >>>> Anyone else ran into this or knows what is going on? >>>> >>>> Thanks, >>>> Joseph >>>> >>>> Did you have any array jobs in your queue when you upgraded? >>>> >>>> Ken >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> From samuel at unimelb.edu.au Thu Mar 22 21:00:11 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 23 Mar 2012 14:00:11 +1100 Subject: [torqueusers] Torque 4.0.0 and cpusets In-Reply-To: <229313363.37076179.1331852061413.JavaMail.root@zm09.stanford.edu> References: <229313363.37076179.1331852061413.JavaMail.root@zm09.stanford.edu> Message-ID: <4F6BE73B.6000904@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 16/03/12 09:54, David Gabriel Simas wrote: > Indeed, /dev/cpuset/cpus doesn't exist on my system. It seems to > be named /dev/cpuset/cpuset.cpus instead. Likewise, > /dev/cpuset/mems seems to be /dev/cpuset/cpuset.mems. That's the > case with two kernels I've tried, 2.6.38.6-26 and 3.2.9-2. It also > seems inconsistent with the documentation in cpuset(7). That looks like you've got the modern cgroup filesystem mounted on /dev/cpuset, not the older cpuset filesystem. I was under the impression that Torque 4 was going to use hwloc for this, so I'm surprised it's still trying the old school way of doing it. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9r5zsACgkQO2KABBYQAh88yQCfQ3bixaY2CDDIJOjLz2nbRC8O 0VEAnR/Cwm3cXAPzIX/Nng7m/Nf80LVZ =qidg -----END PGP SIGNATURE----- From knielson at adaptivecomputing.com Fri Mar 23 08:33:40 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 23 Mar 2012 08:33:40 -0600 Subject: [torqueusers] Torque 4.0.0 and cpusets In-Reply-To: <4F6BE73B.6000904@unimelb.edu.au> References: <229313363.37076179.1331852061413.JavaMail.root@zm09.stanford.edu> <4F6BE73B.6000904@unimelb.edu.au> Message-ID: On Thu, Mar 22, 2012 at 9:00 PM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 16/03/12 09:54, David Gabriel Simas wrote: > > > Indeed, /dev/cpuset/cpus doesn't exist on my system. It seems to > > be named /dev/cpuset/cpuset.cpus instead. Likewise, > > /dev/cpuset/mems seems to be /dev/cpuset/cpuset.mems. That's the > > case with two kernels I've tried, 2.6.38.6-26 and 3.2.9-2. It also > > seems inconsistent with the documentation in cpuset(7). > > That looks like you've got the modern cgroup filesystem mounted on > /dev/cpuset, not the older cpuset filesystem. > > I was under the impression that Torque 4 was going to use hwloc for > this, so I'm surprised it's still trying the old school way of doing it. > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > TORQUE is using hwloc. We will need to see if we can reproduce this. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120323/08dad5da/attachment.html From bonafernando at gmail.com Thu Mar 22 07:46:49 2012 From: bonafernando at gmail.com (Fernando W. Bona) Date: Thu, 22 Mar 2012 10:46:49 -0300 Subject: [torqueusers] How to access torque via PHP on Linux In-Reply-To: References: Message-ID: Hi. We are developing a website and trying to access the Torque with it. The website is being coded with PHP. We have a virtual machine running ubuntu and in it torque and PHP is running. Our objective is to send the same bash commands we do when using Torque by ourselves. Can anyone help? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120322/758d3cb1/attachment.html From nima.irt at gmail.com Fri Mar 23 09:37:04 2012 From: nima.irt at gmail.com (Nima Mohammadi) Date: Fri, 23 Mar 2012 20:07:04 +0430 Subject: [torqueusers] Only a fraction of jobs are being run In-Reply-To: <4F691490.1030302@unimelb.edu.au> References: <4F67C4AA.2080904@unimelb.edu.au> <4F691490.1030302@unimelb.edu.au> Message-ID: On Wed, Mar 21, 2012 at 4:06 AM, Christopher Samuel wrote: > > The issues you'll be seeing will be because of policies in Maui then, > not in Torque. > > Probably something like: > > USERCFG[DEFAULT] ? ? ? ?MAXPROC=512 > USERCFG[DEFAULT] ? ? ? ?MAXJOB=512 > USERCFG[DEFAULT] ? ? ? ?MAXIJOB=5 > > That's taken from our Moab config, but I'm pretty sure it'll be the > same for Maui too. > Well, then I guess I need to wait until after the sysadmin comes back from vacation. Thanks anyway :) [mohammadi at server ~]$ /usr/local/maui/bin/diagnose -Q ERROR: 'diagnose' failed ERROR: user 'mohammadi' is not authorized to execute command 'diagnose' -- Nima Mohammadi From dsimas at stanford.edu Fri Mar 23 14:07:51 2012 From: dsimas at stanford.edu (David Gabriel Simas) Date: Fri, 23 Mar 2012 13:07:51 -0700 (PDT) Subject: [torqueusers] Torque 4.0.0 and cpusets In-Reply-To: Message-ID: <498116318.49327253.1332533271605.JavaMail.root@zm09.stanford.edu> To make pbs_mom happy, I have to do something like this: umount /dev/cpuset umount /sys/fs/cgroup/cpuset mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset DGS ----- Original Message ----- > > > > > On Thu, Mar 22, 2012 at 9:00 PM, Christopher Samuel < > samuel at unimelb.edu.au > wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On 16/03/12 09:54, David Gabriel Simas wrote: > > > Indeed, /dev/cpuset/cpus doesn't exist on my system. It seems to > > be named /dev/cpuset/cpuset.cpus instead. Likewise, > > /dev/cpuset/mems seems to be /dev/cpuset/cpuset.mems. That's the > > case with two kernels I've tried, 2.6.38.6-26 and 3.2.9-2. It also > > seems inconsistent with the documentation in cpuset(7). > > That looks like you've got the modern cgroup filesystem mounted on > /dev/cpuset, not the older cpuset filesystem. > > I was under the impression that Torque 4 was going to use hwloc for > this, so I'm surprised it's still trying the old school way of doing > it. > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > TORQUE is using hwloc. We will need to see if we can reproduce this. > > Ken > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From sm4082 at nyu.edu Fri Mar 23 15:22:33 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 23 Mar 2012 17:22:33 -0400 Subject: [torqueusers] Disabling or restrict in time the interactive queue In-Reply-To: References: <13B5873C-C96F-45F0-A9B4-6678566D1BEF@nyu.edu> Message-ID: Hello, Here is the code you need to put in your submit filter/qsub wrapper #!/bin/bash args=("$@") for((arg=0;arg<$#;arg++)) do if [ "${args[$arg]}" = "-I" ] then exit -1 fi done while read i do echo $i done This simple code make sure that there won't be any interactive job requests from command line as it exits with torque message such as qsub: Your job has been administratively rejected by the queueing system. qsub: There may be a more detailed explanation prior to this notice. You need to keep this code in a file like submit.sh with permissions ls -l submit.sh -rwxr-xr-x 1 root root 45271 Mar 23 16:56 submit.sh Then mention path to this file in /opt/torque/torque.cfg on login node or from where users submit their jobs [root at login-0-0 ~]# cat /opt/torque/torque.cfg #SUBMITFILTER /share/apps/admins/torque/submit_current.sh SUBMITFILTER /share/apps/admins/torque/submit.sh I am not sure whether we can request interactive jobs from the script. But if we can, then I don't think it's that difficult to do it. Just work on each line, which is $i above and look for #PBS in each line and if you find I next to it just do exit -1. This should avoid all the interactive jobs. Sreedhar. On Mar 16, 2012, at 12:33 PM, giggzounet wrote: > Thx a lot! > > I had tested this "disallowed_types", but on a "Route" queue...And it > seems to work only with "Execution" queue. > > Best regards, > Guillaume > > Le 16/03/2012 15:52, Sreedhar Manchu a ?crit : >> qmgr -c 'set queue disallowed_types = interactive' >> >> replace queue name with the queue you want to disable interactive jobs for. >> >> Sreedhar. >> >> >> On Mar 16, 2012, at 10:40 AM, giggzounet wrote: >> >>> Hi, >>> >>> Our university has a cluster with torque (3.0.2)/maui. We would like to >>> disable the interactive queue or to restrict it in time. For example: >>> - we would like to forbid "qsub -I". >>> OR >>> - we would like that "qsub -I" starts an interactive job for 30 minutes >>> maximum. >>> >>> Is it possible ? >>> >>> Thx a lot, >>> Best regards, >>> Guillaume >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Wed Mar 28 18:05:33 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 28 Mar 2012 20:05:33 -0400 Subject: [torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer Message-ID: <4F73A74D.7000105@ldeo.columbia.edu> Dear Torque Pros I am having trouble with cpuset again, and would love to hear your suggestions. I installed Torque 2.4.16 on a standalone machine running CentOS 6.2. Processors are AMD Opteron Bulldozer. I have Torque 2.4.11 with cpuset working right on CentOS 5.2 + AMD Opteron Shanghai, and on CentOS 5.4 AMD Opteron Magny-Cours. Anybody there using the same Torque and CentOS versions with Bulldozer and getting cpuset right? ********************************************** More info [sorry, lengthy, hopefully useful] ********************************************** The configure line looks like this: ../configure \ --prefix=${MYINSTALLDIR} \ --with-server-home=${MYINSTALLDIR} \ --enable-cpuset \ --enable-geometry-requests \ --with-pam \ 2>&1 | tee configure_${build_id}.log However, the test jobs flip forever between Q and R states, and never run. ********* syslog messages look like this [repeated forever]: Mar 28 19:21:36 galera pbs_mom: LOG_ERROR::TMomFinalizeChild, Could not create cpuset for job 1.galera.ldeo.columbia.edu. ******** server logs shows these errors [repeated ad nauseam until it exits]: 03/28/2012 19:21:44;0008;PBS_Server;Job;1.galera.ldeo.columbia.edu;Job Run at request of Scheduler at galera.ldeo.columbia.edu 03/28/2012 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was sent the command recyc 03/28/2012 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was sent the command new ********* mom_logs report errors like the ones below [also repeated many times]. Indeed the file /dev/cpuset/cpus that it cannot locate doesn't exist. It seems to have been renamed [in CentOS 6.2 perhaps?] to /dev/cpuset/cpuset.cpus . Actually, most file names there seem to have benefited from this wonderful prefix "cpuset." Oh, well, innovation has no bounds ... Would this filename mismatch be the problem? Any workaround or patch if this is really the reason for the failure? 03/28/2012 19:09:57;0002; pbs_mom;Svr;Log;Log opened 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.4.16, loglevel = 0 03/28/2012 19:09:57;0002; pbs_mom;Svr;setpbsserver;galera.ldeo.columbia.edu 03/28/2012 19:09:57;0002; pbs_mom;Svr;mom_server_add;server galera.ldeo.columbia.edu added 03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/home /home 03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/data00 /data00 03/28/2012 19:09:57;0002; pbs_mom;n/a;initialize;independent 03/28/2012 19:09:57;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 03/28/2012 19:09:57;0002; pbs_mom;Svr;initialize_root_cpuset;Init TORQUE cpuset /dev/cpuset/torque. 03/28/2012 19:09:57;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::initialize_root_cpuset, cannot locate /dev/cpuset/cpus - cpusets not configured/enabled on host 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Is up 03/28/2012 19:09:57;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /data00/sw/torque/2.4.16/gnu-4.4.6/sbin/pbs_mom 1332200273 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.4.16, loglevel = 0 03/28/2012 19:09:57;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server galera.ldeo.columbia.edu 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Retry job exec failure, retry will be attempted (see syslog for more information) 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 1.galera.ldeo.columbia.edu 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 03/28/2012 19:11:14;0080; pbs_mom;Job;1.galera.ldeo.columbia.edu;obit sent to server 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Retry job exec failure, retry will be attempted (see syslog for more information) 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 1.galera.ldeo.columbia.edu 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 03/28/2012 19:11:14;0080; pbs_mom;Job;1.galera.ldeo.columbia.edu;obit sent to server 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Retry job exec failure, retry will be attempted (see syslog for more information) 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 1.galera.ldeo.columbia.edu 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 03/28/2012 19:11:14;0080; pbs_mom;Job;1.galera.ldeo.columbia.edu;obit sent to server ********** Thank you, Gus Correa From roy.dragseth at cc.uit.no Tue Mar 27 15:59:24 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Tue, 27 Mar 2012 23:59:24 +0200 Subject: [torqueusers] problem with libtm under torque 4.0 Message-ID: <3301327.TcM7lWz6qi@lux> I have just installed torque 4.0 on my test cluster and there seems to be some issues with pbdsh and OSC mpiexec. Do anyone else have problems with these? I just want to check before I dive deeper into this. The problem I see is that if I run pbsdsh within a job $ pbsdsh -u uname -a pbsdsh: error from tm_poll() 17002 If I drop the -u flag it seems to work a bit better, but still get some error messages. $ pbsdsh uname -a Linux compute-0-2.local 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux pbsdsh: Event poll failed, error TM_ENOTCONNECTED Linux compute-0-2.local 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux Linux compute-0-1.local 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux Linux compute-0-1.local 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux pbsdsh: reconnected pbsdsh: Event poll failed, error TM_ENOTFOUND also, pbs_mom tends to segfault when I try this. From dmesg pbs_mom[16801]: segfault at 0000000000000020 rip 000000000040ac36 rsp 00007fff32754f00 error 4 Do anyone else see anything similar? Torque v3.0.2 do not have this problem on exact same setup. This is on CentOS 5.8. Torque is compiled without hwloc and I have not configured any cpusets. Regards, r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From dsimas at stanford.edu Fri Mar 30 16:28:06 2012 From: dsimas at stanford.edu (David Gabriel Simas) Date: Fri, 30 Mar 2012 15:28:06 -0700 (PDT) Subject: [torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer In-Reply-To: <4F73A74D.7000105@ldeo.columbia.edu> Message-ID: <1831500498.60711294.1333146486889.JavaMail.root@zm09.stanford.edu> ----- Original Message ----- > Dear Torque Pros > > I am having trouble with cpuset again, > and would love to hear your suggestions. > > I installed Torque 2.4.16 on a standalone machine > running CentOS 6.2. Processors are AMD Opteron Bulldozer. > > I have Torque 2.4.11 with cpuset working right on > CentOS 5.2 + AMD Opteron Shanghai, > and on CentOS 5.4 AMD Opteron Magny-Cours. > > Anybody there using the same Torque > and CentOS versions with Bulldozer > and getting cpuset right? > > ********************************************** > More info [sorry, lengthy, hopefully useful] > ********************************************** > > The configure line looks like this: > > ../configure \ > --prefix=${MYINSTALLDIR} \ > --with-server-home=${MYINSTALLDIR} \ > --enable-cpuset \ > --enable-geometry-requests \ > --with-pam \ > 2>&1 | tee configure_${build_id}.log > > However, the test jobs flip forever between Q and R states, > and never run. > > ********* > > syslog messages look like this [repeated forever]: > Mar 28 19:21:36 galera pbs_mom: LOG_ERROR::TMomFinalizeChild, Could > not > create cpuset for job 1.galera.ldeo.columbia.edu. > > ******** > > server logs shows these errors [repeated ad nauseam until it exits]: > 03/28/2012 > 19:21:44;0008;PBS_Server;Job;1.galera.ldeo.columbia.edu;Job > Run at request of Scheduler at galera.ldeo.columbia.edu > 03/28/2012 > 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was > sent > the command recyc > 03/28/2012 > 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was > sent > the command new > > ********* > > mom_logs report errors like the ones below [also repeated many > times]. > > Indeed the file /dev/cpuset/cpus > that it cannot locate doesn't exist. > It seems to have been renamed [in CentOS 6.2 perhaps?] to > /dev/cpuset/cpuset.cpus . > Actually, most file names there seem to have benefited from this > wonderful prefix "cpuset." > Oh, well, innovation has no bounds ... > > Would this filename mismatch be the problem? > Any workaround or patch if this is really the reason for the failure? > A work-around I found in testing Torque 4.0.0 is: umount /dev/cpuset umount /sys/fs/cgroup/cpuset mount -t cgroup -o cupset,noprefix X /sys/fs/cgroup/cpuset Then pbs_mom starts up and works with no errors. However, torque doesn't seem to understand the difference between cores and hyperthreads. With cpusets enabled, a one processor job will be bound to a core, giving it two hyperthreads. DGS > > > 03/28/2012 19:09:57;0002; pbs_mom;Svr;Log;Log opened > 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.4.16, loglevel = 0 > 03/28/2012 19:09:57;0002; > pbs_mom;Svr;setpbsserver;galera.ldeo.columbia.edu > 03/28/2012 19:09:57;0002; pbs_mom;Svr;mom_server_add;server > galera.ldeo.columbia.edu added > 03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/home /home > 03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/data00 /data00 > 03/28/2012 19:09:57;0002; pbs_mom;n/a;initialize;independent > 03/28/2012 19:09:57;0080; pbs_mom;Svr;pbs_mom;before > init_abort_jobs > 03/28/2012 19:09:57;0002; pbs_mom;Svr;initialize_root_cpuset;Init > TORQUE cpuset /dev/cpuset/torque. > 03/28/2012 19:09:57;0001; > pbs_mom;Svr;pbs_mom;LOG_ERROR::initialize_root_cpuset, cannot locate > /dev/cpuset/cpus - cpusets not configured/enabled on host > 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Is up > 03/28/2012 19:09:57;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: > /data00/sw/torque/2.4.16/gnu-4.4.6/sbin/pbs_mom 1332200273 > 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.4.16, loglevel = 0 > 03/28/2012 19:09:57;0002; > pbs_mom;n/a;mom_server_check_connection;sending hello to server > galera.ldeo.columbia.edu > 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not > started, Retry job exec failure, retry will be attempted (see syslog > for > more information) > 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to > sisters for job 1.galera.ldeo.columbia.edu > 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of > preobit_reply > 03/28/2012 19:11:14;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, > top > of while loop > 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, > no > error from job stat > 03/28/2012 19:11:14;0080; > pbs_mom;Job;1.galera.ldeo.columbia.edu;obit > sent to server > 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not > started, Retry job exec failure, retry will be attempted (see syslog > for > more information) > 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to > sisters for job 1.galera.ldeo.columbia.edu > 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of > preobit_reply > 03/28/2012 19:11:14;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, > top > of while loop > 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, > no > error from job stat > 03/28/2012 19:11:14;0080; > pbs_mom;Job;1.galera.ldeo.columbia.edu;obit > sent to server > 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not > started, Retry job exec failure, retry will be attempted (see syslog > for > more information) > 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to > sisters for job 1.galera.ldeo.columbia.edu > 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of > preobit_reply > 03/28/2012 19:11:14;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, > top > of while loop > 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, > no > error from job stat > 03/28/2012 19:11:14;0080; > pbs_mom;Job;1.galera.ldeo.columbia.edu;obit > sent to server > > > ********** > > Thank you, > Gus Correa > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From jjc at iastate.edu Mon Mar 26 09:52:27 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 26 Mar 2012 15:52:27 +0000 Subject: [torqueusers] How to access torque via PHP on Linux In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210198EB9B@ITSDAG1D.its.iastate.edu> How about using backticks or the shell_exec command in php to copy the script over and to issue an interactive qsub which waits for the output. For example: $output = shell_exec ( scp script remote_machine: ; ssh remote_machine "qsub -I -x script" ); You'll need to setup passwordless ssh for the account you are using on the remote machine, make it the same non-root account. See: http://php.net/manual/en/function.shell-exec.php From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Fernando W. Bona Sent: Thursday, March 22, 2012 8:47 AM To: torqueusers at supercluster.org Subject: [torqueusers] How to access torque via PHP on Linux Hi. We are developing a website and trying to access the Torque with it. The website is being coded with PHP. We have a virtual machine running ubuntu and in it torque and PHP is running. Our objective is to send the same bash commands we do when using Torque by ourselves. Can anyone help? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120326/e9e48bd8/attachment-0001.html From stevenx.a.duchene at intel.com Fri Mar 30 10:59:22 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 30 Mar 2012 16:59:22 +0000 Subject: [torqueusers] torque 4.0 hostname resolution patch? In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805ABFF3E@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABFE4C@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFED5@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFF3E@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0073@ORSMSX106.amr.corp.intel.com> Just to make sure it is not a problem with my network here I just tried a svn checkout from a completely different svn repository (redmine) svn co svn://rubyforge.org/var/svn/redmine/branches/1.3-stable redmine-1.3 and that worked just fine. What is up with the Torque repository site you provided? Do I have to be a torque developer to access it? -- Steven DuChene -----Original Message----- From: DuChene, StevenX A Sent: Thursday, March 29, 2012 4:18 PM To: Ken Nielson Subject: RE: torque 4.0 hostname resolution patch? I am trying this from an open network from my house and I keep getting: svn: Can't connect to host 'clusterresources.com': Connection timed out I tried with the IP address that corresponds to that name but the same thing happens. svn: Can't connect to host '204.15.87.226': Connection timed out Steven, The TORQUE sources us under subversion. So you can get the code using svn co svn://clusterresources.com/torque/branches/4.0-fixes Let me know if you have any problems with that URL. Ken ? From stevenx.a.duchene at intel.com Thu Mar 29 10:15:55 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 29 Mar 2012 16:15:55 +0000 Subject: [torqueusers] torque 4.0 hostname resolution patch? Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805ABFE4C@ORSMSX106.amr.corp.intel.com> Ken: Any progress or news on the patch you were working on for the localhost.localdomain address resolution and authorization stuff that you said might help with the torque to maui communication issue I am seeing? -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120329/a67b6939/attachment.html From mej at lbl.gov Fri Mar 30 16:51:54 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 30 Mar 2012 15:51:54 -0700 Subject: [torqueusers] torque 4.0 hostname resolution patch? In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0073@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABFE4C@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFED5@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFF3E@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0073@ORSMSX106.amr.corp.intel.com> Message-ID: <20120330225153.GN9750@lbl.gov> On Friday, 30 March 2012, at 16:59:22 (+0000), DuChene, StevenX A wrote: > What is up with the Torque repository site you provided? Steve: The SVN server appears to be down at the moment. I'm sure they'll fix it as soon as they can. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From stevenx.a.duchene at intel.com Fri Mar 30 16:54:00 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Fri, 30 Mar 2012 22:54:00 +0000 Subject: [torqueusers] torque 4.0 hostname resolution patch? In-Reply-To: <20120330225153.GN9750@lbl.gov> References: <560DBE57F33C4C4C9FBF11C662951AF805ABFE4C@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFED5@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFF3E@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0073@ORSMSX106.amr.corp.intel.com> <20120330225153.GN9750@lbl.gov> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC01B5@ORSMSX106.amr.corp.intel.com> Thanks for the confirmation Michael. Are you the same Michael Jennings that worked at VA Linux Systems? -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Michael Jennings Sent: Friday, March 30, 2012 3:52 PM To: torqueusers at supercluster.org Subject: Re: [torqueusers] torque 4.0 hostname resolution patch? On Friday, 30 March 2012, at 16:59:22 (+0000), DuChene, StevenX A wrote: > What is up with the Torque repository site you provided? Steve: The SVN server appears to be down at the moment. I'm sure they'll fix it as soon as they can. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Thu Mar 29 15:47:59 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 29 Mar 2012 15:47:59 -0600 Subject: [torqueusers] Just a test Message-ID: The mail server seems to be down. This is testing to see if we are back up Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120329/186e3ab6/attachment.html From l.flis at cyf-kr.edu.pl Tue Mar 27 16:59:34 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Wed, 28 Mar 2012 00:59:34 +0200 Subject: [torqueusers] Problem with TM interface when using --enable-numa Message-ID: <4F724656.2070205@cyf-kr.edu.pl> Hi It seems that TM interface in Torque 3.0.4 compiled with --enable-numa flag is broken. Example: qsub -I -l nodes=4:ppn=1 qsub: waiting for job 307.batch-xsmp to start qsub: job 307.batch-xsmp ready [@xsmp4-3-1 ~]$ cat $PBS_NODEFILE xsmp4-3-1.local xsmp4-2-4.local xsmp4-2-3.local xsmp4-1-2.local #mpiexec from openmpi compiled with TM support mpiexec uname -n xsmp4-3-1.local xsmp4-3-1.local xsmp4-3-1.local xsmp4-3-1.local The job above had been allocated 4 different nodes. However mpiexec or pbsdsh runs given command 4 times on the first of hosts from $PBS_NODE file Is this desired behaviour? I haven't tested Torque 4.0 with numa but I suspect it could have the same problem. Cheers -- LKF From giggzounet at gmail.com Wed Mar 28 07:13:12 2012 From: giggzounet at gmail.com (giggzounet) Date: Wed, 28 Mar 2012 15:13:12 +0200 Subject: [torqueusers] Torque 3.0.2: Pb with "Bad UID for job execution MSG=ruserok" on node (rsh/ssh questions) Message-ID: Hi, We have a cluster with torque 3.0.2 (and maui). We would like to be able to start job on nodes too. But a qsub on the node gives: Bad UID for job execution MSG=ruserok failed validating... We have no problem when we start a job from the frontend. The configuration is a classical one: - frontend: pbs_server (allow_node_submit = True) + maui are running - nodes: pbs_mom is running - we have a /etc/ssh/shosts.equiv file with all the nodes and frontend names. If I copy /etc/ssh/shosts.equiv to /etc/hosts.equiv the "bad uid" problem disappears. But I don't want rsh... The torque package was configured and installed by the firm which built the cluster. A few questions: - How can I know if torque is using rsh or ssh ? - Is there a solution to know the configure options used for our torque ? I mean for example "scp". I don't want rsh. - Is there a way to force torque to use ssh instead of rsh ? - If torque is using ssh, why it works with the /etc/hosts.equiv and not with /etc/ssh/shosts.equiv ? Thx a lot, best regards, Guillaume From Gareth.Williams at csiro.au Fri Mar 30 19:18:11 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sat, 31 Mar 2012 12:18:11 +1100 Subject: [torqueusers] Problem with TM interface when using --enable-numa In-Reply-To: <4F724656.2070205@cyf-kr.edu.pl> References: <4F724656.2070205@cyf-kr.edu.pl> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102DCC742D7@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Lukasz Flis [mailto:l.flis at cyf-kr.edu.pl] > Sent: Wednesday, 28 March 2012 10:00 AM > To: Torque Developers mailing list; Torque Users Mailing List > Subject: [torqueusers] Problem with TM interface when using --enable- > numa > > Hi > > It seems that TM interface in Torque 3.0.4 compiled with --enable-numa > flag is broken. > > Example: > > qsub -I -l nodes=4:ppn=1 > qsub: waiting for job 307.batch-xsmp to start > qsub: job 307.batch-xsmp ready > > > [@xsmp4-3-1 ~]$ cat $PBS_NODEFILE > xsmp4-3-1.local > xsmp4-2-4.local > xsmp4-2-3.local > xsmp4-1-2.local > > #mpiexec from openmpi compiled with TM support > mpiexec uname -n > xsmp4-3-1.local > xsmp4-3-1.local > xsmp4-3-1.local > xsmp4-3-1.local > > > The job above had been allocated 4 different nodes. > However mpiexec or pbsdsh runs given command 4 times on the first of > hosts from $PBS_NODE file > > Is this desired behaviour? I haven't tested Torque 4.0 with numa but I > suspect it could have the same problem. > > Cheers > -- > LKF I see different behaviour with our 3.0.4-snap.201201051014 numa enabled setup and I think I see a/the problem/difference. Our numa setup has a single host - cherax, with a set of logical numa-nodes cherax-0, cherax-1 ... In jobs the $PBS_NODEFILE gets populated with the actual hostname - cherax, not the logical numa-node names and pbsdsh works fine afaik though I've not checked if launched processes are allocated to appropriate cores (I'd not necessarily expect that anyway). It looks like you actually have a multi-node system and I had feedback some time ago that I couldn't run a numa enabled torque on such a system (yet). Your 'uname -n' output seems to indicate you are not using/getting a numa setup. Gareth From avb at ssau.ru Fri Mar 30 23:18:33 2012 From: avb at ssau.ru (Alexandr Baskakov) Date: Sat, 31 Mar 2012 09:18:33 +0400 Subject: [torqueusers] Can't submit job from remote submit host Message-ID: <4F7693A9.9080501@ssau.ru> Hi, All. I'am trying to submit job from submit host to remote server with torque. Have 2 nodes: mgt1 - torque client mgt2 - torque server and moab. Domain: ssc On mgt2: [mgt2 ~]$ qmgr -c 'l s' Server mgt2 server_state = Active scheduling = True total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 acl_hosts = localhost,mgt2,mgt1 managers = root at mgt2,torque at mgt2 operators = root at mgt2,torque at mgt2 default_queue = batch log_events = 511 mail_from = adm query_other_jobs = True resources_assigned.ncpus = 0 resources_assigned.nodect = 0 scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 log_level = 7 mom_job_sync = True pbs_version = 3.0.2 keep_completed = 300 submit_hosts = mgt1.ssc next_job_number = 57 net_counter = 2 0 0 When I trying to submit job from mgt1 by: [mgt1 ~]$ PBS_DEFAULT=mgt2 qsub hostname qsub: Bad UID for job execution MSG=ruserok failed validating avb/avb from mgt1 have an error. On mgt2, in logfile: 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command AuthenticateUser from avb 03/26/2012 15:52:57;0100;PBS_Server;Req;;Type AuthenticateUser request received from avb at mgt1.ssc, sock=14 03/26/2012 15:52:57;0008;PBS_Server;Job;dispatch_request;dispatching request AuthenticateUser on sd=14 03/26/2012 15:52:57;0008;PBS_Server;Job;reply_send;Reply sent for request type AuthenticateUser on socket 14 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command Disconnect from PBS_Server 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command QueueJob from avb 03/26/2012 15:52:57;0100;PBS_Server;Req;;Type QueueJob request received from avb at mgt1.ssc, sock=13 03/26/2012 15:52:57;0008;PBS_Server;Job;dispatch_request;dispatching request QueueJob on sd=13 03/26/2012 15:52:57;0080;PBS_Server;Job;62.mgt2;removed job file 03/26/2012 15:52:57;0080;PBS_Server;Req;req_reject;Reject reply code=15025(Bad UID for job execution MSG=ruserok failed validating avb/avb from mgt1), aux=0, type=QueueJob, from avb at mgt1.ssc 03/26/2012 15:52:57;0008;PBS_Server;Job;reply_send;Reply sent for request type QueueJob on socket 13 Authentication on mgt1,mgt2 making by nss_ldap. Login to mgt2 by user avb works ok. Can anyone halp, please... -- Alexandr Baskakov, Samara State Aerospace University e-mail: avb at ssau.ru From ianm at uchicago.edu Mon Mar 26 15:35:43 2012 From: ianm at uchicago.edu (Ian Miller) Date: Mon, 26 Mar 2012 21:35:43 +0000 Subject: [torqueusers] Simple Q. about controlling CPU utilization per user In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE45@exvic-mbx04.nexus.csiro.au> Message-ID: Hi All, Is their a simple switch or config edit to curb the CPU utilization per job submitted in torque? I'm running the 3.0.3. Thx -I -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120326/baf1bd94/attachment-0001.html From mike.d.stevens at gmail.com Wed Mar 28 15:11:57 2012 From: mike.d.stevens at gmail.com (mike.d.stevens at gmail.com) Date: Wed, 28 Mar 2012 21:11:57 +0000 Subject: [torqueusers] pbs_sched cores Message-ID: <047d7b339ccfc1a6bf04bc540ef8@google.com> I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being that we can use the physical resources of the torque cluster when no jobs are running. I am seeing crashes of pbs_sched when the cluster gets busy, which seem to be more pronounced when the network is busy. Following is some data I've been able to assemble thus far: /var/log/messages Mar 28 09:44:41 node25 pbs_mom: LOG_ERROR::Broken pipe (32) in rm_request, write request response failed: Protocol failure in commit#012#011message refused from port 1022 addr 10.80.101.10 Mar 28 09:44:51 node27 pbs_mom: LOG_ERROR::Broken pipe (32) in rm_request, write request response failed: Protocol failure in commit#012#011message refused from port 1022 addr 10.80.101.10 Mar 28 09:47:05 cluster1 kernel: pbs_sched[15017]: segfault at 0 ip 0000003ff4a13c44 sp 00007fff58347c90 error 4 in libtorque.so.2.0.0[3ff4a00000+2d000] Mar 28 09:47:05 cluster1 abrt[23193]: saved core dump of pid 15017 (/usr/sbin/pbs_sched) to /var/spool/abrt/ccpp-2012-03-28-09:47:05-15017.new/coredump (3407872 bytes) Mar 28 09:47:05 cluster1 abrtd: Directory 'ccpp-2012-03-28-09:47:05-15017' creation detected Mar 28 09:47:05 cluster1 abrtd: Package 'torque-scheduler' isn't signed with proper key Mar 28 09:47:05 cluster1 abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2012-03-28-09:47:05-15017 (res:2), deleting sched_log 03/28/2012 09:46:55;0040; pbs_sched;Job;267980.cluster1.cluster.affymetrix.com;Job Run 03/28/2012 09:46:55;0040; pbs_sched;Job;267981.cluster1.cluster.affymetrix.com;Job Run 03/28/2012 09:47:00;0040; pbs_sched;Job;267982.cluster1.cluster.affymetrix.com;Job Run 03/28/2012 09:47:05;0040; pbs_sched;Job;267983.cluster1.cluster.affymetrix.com;Job Run 03/28/2012 09:53:32;0002; pbs_sched;Svr;Log;Log opened 03/28/2012 09:53:32;0002; pbs_sched;Svr;TokenAct;Account file /var/lib/torque/sched_priv/accounting/20120328 opened 03/28/2012 09:53:32;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 23588 03/28/2012 09:53:33;0040; pbs_sched;Job;267984.cluster1.cluster.affymetrix.com;Job Run 03/28/2012 09:53:34;0040; pbs_sched;Job;267985.cluster1.cluster.affymetrix.com;Job Run gdb of core file [root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.15017 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: . [New Thread 15017] Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from /usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done. done. Loaded symbols for /usr/lib64/libtorque.so.2.0.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_dns.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'. Program terminated with signal 11, Segmentation fault. #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=, num_resc=, available=0x7fff58347d0c, allocated=0x7fff58347d08, reserved=0x7fff58347d04, down=0x7fff58347d00) at ../Libifl/pbsD_resc.c:215 215 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i); Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 (gdb) bt #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=, num_resc=, available=0x7fff58347d0c, allocated=0x7fff58347d08, reserved=0x7fff58347d04, down=0x7fff58347d00) at ../Libifl/pbsD_resc.c:215 #1 0x000000000040c8d6 in ?? () #2 0x00007fff58347d00 in ?? () #3 0x0000000000000000 in ?? () (gdb) Does anyone have any ideas as to what is wrong here? I'd be happy to provide additional information. -- Mike Stevens -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120328/73791b1f/attachment.html From sjf4 at uw.edu Wed Mar 28 16:41:25 2012 From: sjf4 at uw.edu (Stephen Fralich) Date: Wed, 28 Mar 2012 22:41:25 +0000 Subject: [torqueusers] pbs_mom communication problem causes job deletion Message-ID: I'm running Torque 2.5.10 and Moab 6.0.8. Every few weeks or every few months, I experience a period of job deletions which seem to be caused by some communication difficulty between pbs_mom and pbs_server. A job that's been running and preempted successfully 10s of times over the course of days, is preempted and starts running again. There's some miscommunication between pbs_server and pbs_mom. pbs_server deletes the job (immediately after the delete, qstat returns unknown job), pbs_mom receives the job deletion, but the job keeps running anyway. The most inconvenient part of this is that Moab doesn't understand what's happened and waits 15 minutes for the job start to return. I have some reason to believe this is caused by stdout files which are larger than 1MB or so, but it's not always the case. In most cases however, it seems like the transfer of the stdout file from pbs_server to pbs_mom causes some necessary subsequent connection to time out. You can find level 10 pbs_mom logs and level 7 pbs_server logs at the below URL. Job 527014 has a small stdout file (28KB) and fails in the same way that job 526003 fails which has a large stdout file (several MB). The mom logs cover the job start. The server logs cover from the preemption to the end of the job start. http://staff.washington.edu/sjf4/uw_20120328/ I'd greatly appreciate anyone with any information speaking up. Thanks, Stephen