From gas5x at yahoo.com Fri Jun 1 08:37:04 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Fri, 1 Jun 2012 07:37:04 -0700 (PDT) Subject: [torqueusers] checkpointable jobs lose environment variables? In-Reply-To: <1338496081.8869.YahooMailClassic@web111313.mail.gq1.yahoo.com> Message-ID: <1338561424.28101.YahooMailClassic@web111312.mail.gq1.yahoo.com> Actually, if I switch to tcsh, for both job script itself and #PBS -S, the environment gets passed somehow. But it case of bash it doesnt! -- Grigory Shamov --- On Thu, 5/31/12, Grigory Shamov wrote: > From: Grigory Shamov > Subject: [torqueusers] checkpointable jobs lose environment variables? > To: torqueusers at supercluster.org > Date: Thursday, May 31, 2012, 1:28 PM > Hi All, > > I have tried to install the BLCR checkpoint/restart (0.8.4) > -enabled Torque (2.5.11), on a few old CentOS 5 machines we > have (kernels 2.6.18.308, 2.6.18.194). I have built Torque > with --enable-blcr switch, and the BLCR was installed as a > system RPM (to /usr/bin etc.). > > The simple seconds-counting test seem to work. However, an > user application test failed, the reason being unaccessible > environment modules. I've checked with 'env' command and > found, that while normal 'qsub' passes all the environment, > 'qsub -c' does not. > > The job script was really minimal. > > #!/bin/bash > #PBS -N test > #PBS -l procs=2,walltime=21:10,mem=2mb > #PBS -r y > #PBS -S /bin/bash > # > > env > > cd $PBS_O_WORKDIR > ./test.x > # done > > Results of 'env' differ, that for 'qsub -c' almost only > $PBS_* things are passed, while for 'qsub' there would be > everything. > > Could you please tell whether it is a desired behaviour or a > bug, or is there a way to pass environment explicitly for > 'qsub -c'? > > Thank you very much! > > -- > Grigory Shamov > HPC Analyst, > University of Manitoba > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From sm4082 at nyu.edu Fri Jun 1 08:54:06 2012 From: sm4082 at nyu.edu (Sreedhar M) Date: Fri, 1 Jun 2012 10:54:06 -0400 Subject: [torqueusers] checkpointable jobs lose environment variables? In-Reply-To: <1338561424.28101.YahooMailClassic@web111312.mail.gq1.yahoo.com> References: <1338561424.28101.YahooMailClassic@web111312.mail.gq1.yahoo.com> Message-ID: <22F28F06-7170-45CE-82D2-F42DC67D1866@nyu.edu> Hi Griogory, Can you try adding #PBS -V to your script. Not sure whether it'd fix it. But worth giving it a try. Sreedhar Manchu HPC Support Specialist ITS-Esystems/Research Services New York University, NY - 100012 On Jun 1, 2012, at 10:37 AM, Grigory Shamov wrote: > Actually, if I switch to tcsh, for both job script itself and #PBS -S, the environment gets passed somehow. But it case of bash it doesnt! > > > -- > Grigory Shamov > > > --- On Thu, 5/31/12, Grigory Shamov wrote: > >> From: Grigory Shamov >> Subject: [torqueusers] checkpointable jobs lose environment variables? >> To: torqueusers at supercluster.org >> Date: Thursday, May 31, 2012, 1:28 PM >> Hi All, >> >> I have tried to install the BLCR checkpoint/restart (0.8.4) >> -enabled Torque (2.5.11), on a few old CentOS 5 machines we >> have (kernels 2.6.18.308, 2.6.18.194). I have built Torque >> with --enable-blcr switch, and the BLCR was installed as a >> system RPM (to /usr/bin etc.). >> >> The simple seconds-counting test seem to work. However, an >> user application test failed, the reason being unaccessible >> environment modules. I've checked with 'env' command and >> found, that while normal 'qsub' passes all the environment, >> 'qsub -c' does not. >> >> The job script was really minimal. >> >> #!/bin/bash >> #PBS -N test >> #PBS -l procs=2,walltime=21:10,mem=2mb >> #PBS -r y >> #PBS -S /bin/bash >> # >> >> env >> >> cd $PBS_O_WORKDIR >> ./test.x >> # done >> >> Results of 'env' differ, that for 'qsub -c' almost only >> $PBS_* things are passed, while for 'qsub' there would be >> everything. >> >> Could you please tell whether it is a desired behaviour or a >> bug, or is there a way to pass environment explicitly for >> 'qsub -c'? >> >> Thank you very much! >> >> -- >> Grigory Shamov >> HPC Analyst, >> University of Manitoba >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gas5x at yahoo.com Fri Jun 1 11:21:47 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Fri, 1 Jun 2012 10:21:47 -0700 (PDT) Subject: [torqueusers] checkpointable jobs lose environment variables? In-Reply-To: <22F28F06-7170-45CE-82D2-F42DC67D1866@nyu.edu> Message-ID: <1338571307.79053.YahooMailClassic@web111301.mail.gq1.yahoo.com> Dear Sredar, Thank you! It does fix the problem for bash. Although the environments are not identical completely for 'qsub' and 'qsub -c' yet, sufficiently more is passed through now. But still it is somewhat confusing why it happens so. -- Grigory Shamov --- On Fri, 6/1/12, Sreedhar M wrote: > From: Sreedhar M > Subject: Re: [torqueusers] checkpointable jobs lose environment variables? > To: "Torque Users Mailing List" > Date: Friday, June 1, 2012, 7:54 AM > Hi Griogory, > > Can you try adding > > #PBS -V > > to your script. Not sure whether it'd fix it. But worth > giving it a try. > > Sreedhar Manchu > HPC Support Specialist > ITS-Esystems/Research Services > New York University, NY - 100012 > > On Jun 1, 2012, at 10:37 AM, Grigory Shamov wrote: > > > Actually, if I switch to tcsh, for both job script > itself and #PBS -S, the environment gets passed somehow. But > it case of bash it doesnt! > > > > > > -- > > Grigory Shamov > > > > > > --- On Thu, 5/31/12, Grigory Shamov > wrote: > > > >> From: Grigory Shamov > >> Subject: [torqueusers] checkpointable jobs lose > environment variables? > >> To: torqueusers at supercluster.org > >> Date: Thursday, May 31, 2012, 1:28 PM > >> Hi All, > >> > >> I have tried to install the BLCR checkpoint/restart > (0.8.4) > >> -enabled Torque (2.5.11), on a few old CentOS 5 > machines we > >> have (kernels 2.6.18.308, 2.6.18.194). I have built > Torque > >> with --enable-blcr switch, and the BLCR was > installed as a > >> system RPM (to /usr/bin etc.). > >> > >> The simple seconds-counting test seem to work. > However, an > >> user application test failed, the reason being > unaccessible > >> environment modules. I've checked with 'env' > command and > >> found, that while normal 'qsub' passes all the > environment, > >> 'qsub -c' does not. > >> > >> The job script was really minimal. > >> > >> #!/bin/bash > >> #PBS -N test > >> #PBS -l procs=2,walltime=21:10,mem=2mb > >> #PBS -r y > >> #PBS -S /bin/bash > >> # > >> > >> env > >> > >> cd $PBS_O_WORKDIR > >> ./test.x > >> # done > >> > >> Results of 'env' differ, that for 'qsub -c' almost > only > >> $PBS_* things are passed, while for 'qsub' there > would be > >> everything. > >> > >> Could you please tell whether it is a desired > behaviour or a > >> bug, or is there a way to pass environment > explicitly for > >> 'qsub -c'? > >> > >> Thank you very much! > >> > >> -- > >> Grigory Shamov > >> HPC Analyst, > >> University of Manitoba > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From gas5x at yahoo.com Fri Jun 1 11:57:01 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Fri, 1 Jun 2012 10:57:01 -0700 (PDT) Subject: [torqueusers] cpusets in Torque 3.0, memory pressure Message-ID: <1338573421.29117.YahooMailClassic@web111314.mail.gq1.yahoo.com> Hi, The Torque 3 changelog says " when cpusets are configured and memory pressure enabled, add the ability to check memory pressure for a job. Using $memory_pressure_threshold and $memory_pressure_duration in the mom's config, the admin sets a threshold at? which a job becomes a problem". Does it work for non-NUMA setup as well? I mean, if I configure it with --enable-cpusets only, but --disable-numa-support ? -- Grigory Shamov HPC Analyst, University of Manitoba From kunalgrao at gmail.com Fri Jun 1 13:57:22 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Fri, 1 Jun 2012 15:57:22 -0400 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: I removed NODEALLOCATIONPOLICY and tried again, this time it started the job but the node allocation was not as expected. The job needs 1 node with 2 proc and 3 nodes with 1 proc each. The allocation was done on only 3 nodes. 2 with 2 procs and 1 with 1 proc. Not sure if this is a bug or some conflicts in the configuration. My current additional configurations are : BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST ENABLEMULTIREQJOBS TRUE JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SINGLEJOB I also tried with this, but still the same : BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST ENABLEMULTIREQJOBS TRUE NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='APROCS' JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SINGLEJOB Any suggestions ? Thanks, Kunal On Thu, May 31, 2012 at 10:26 PM, Kunal Rao wrote: > I need NODEACCESSPOLICY, maybe I'll remove NODEALLOCATIONPOLICY and check > tomorrow. > > Thanks, > Kunal > > > On Thu, May 31, 2012 at 10:23 PM, Ju JiaJia wrote: > >> Seems all be ok. I think you could try to delete the additional >> configuration in maui.cfg. like NODEALLOCATIONPOLICY, NODEACCESSPOLICY, >> or use default or other options. >> >> >> On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao wrote: >> >>> Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each >>> of the 10 nodes : >>> >>> np=16 gpus=1 >>> >>> Thanks, >>> Kunal >>> >>> >>> On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia wrote: >>> >>>> How many cores on each of the 10 nodes ? I mean you are trying to >>>> allocate 2 processors on one node. And how did you >>>> configure TORQUE_HOME/server_priv/nodes ? >>>> >>>> >>>> On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao wrote: >>>> >>>>> Queue / Server configuration : >>>>> >>>>> --------------- >>>>> >>>>> qmgr -c 'p s' >>>>> # >>>>> # Create queues and set their attributes. >>>>> # >>>>> # >>>>> # Create and define queue batch >>>>> # >>>>> create queue batch >>>>> set queue batch queue_type = Execution >>>>> set queue batch resources_default.nodes = 1 >>>>> set queue batch resources_default.walltime = 01:00:00 >>>>> set queue batch enabled = True >>>>> set queue batch started = True >>>>> # >>>>> # Set server attributes. >>>>> # >>>>> set server scheduling = True >>>>> set server acl_hosts = fire16 >>>>> set server acl_roots = root at fire16.csa.local >>>>> set server managers = root at fire16.csa.local >>>>> set server operators = root at fire16.csa.local >>>>> set server default_queue = batch >>>>> set server log_events = 511 >>>>> set server mail_from = adm >>>>> set server scheduler_iteration = 20 >>>>> set server node_check_rate = 150 >>>>> set server tcp_timeout = 6 >>>>> set server mom_job_sync = True >>>>> set server keep_completed = 300 >>>>> set server allow_node_submit = True >>>>> set server next_job_number = 6331 >>>>> >>>>> --------------- >>>>> >>>>> Job resource requirement : >>>>> >>>>> --------- >>>>> >>>>> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 >>>>> >>>>> --------- >>>>> >>>>> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all >>>>> accessible. >>>>> >>>>> Thanks, >>>>> Kunal >>>>> >>>>> >>>>> On 5/31/12, Ju JiaJia wrote: >>>>> > Please give your queue/server configuration and your job's resources >>>>> need, >>>>> > cpu/memory etc. And Does all the 10 nodes accessable? You can use >>>>> pbsnodes >>>>> > to check this. >>>>> > >>>>> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao >>>>> wrote: >>>>> > >>>>> >> Hello, >>>>> >> >>>>> >> Please see the below message. I had posted it on maui users mailing >>>>> list, >>>>> >> but did not get any response, so thought of posting it here on >>>>> torque >>>>> >> users >>>>> >> mailing list (incase someone would know). Kindly let me know if you >>>>> have >>>>> >> any comments / ideas / suggestions. >>>>> >> >>>>> >> Thanks, >>>>> >> Kunal >>>>> >> >>>>> >> ---------- Forwarded message ---------- >>>>> >> From: Kunal Rao >>>>> >> Date: Wed, May 23, 2012 at 2:30 PM >>>>> >> Subject: Re: Multi-req job not starting >>>>> >> To: mauiusers at supercluster.org >>>>> >> >>>>> >> >>>>> >> There was a similar post earlier : >>>>> >> >>>>> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html >>>>> >> >>>>> >> But did not find any response to it. Can anyone please provide some >>>>> ideas >>>>> >> / suggestion on this issue. >>>>> >> >>>>> >> Thanks, >>>>> >> Kunal >>>>> >> >>>>> >> >>>>> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao >>>>> wrote: >>>>> >> >>>>> >>> Hello, >>>>> >>> >>>>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes >>>>> ( with >>>>> >>> 1 task per node ), another which needs 4 nodes (with 1 task per >>>>> node) >>>>> >>> and >>>>> >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 >>>>> task >>>>> >>> each >>>>> >>> on the other 3 nodes ). >>>>> >>> >>>>> >>> Additional configuration in maui.cfg is : >>>>> >>> >>>>> >>> BACKFILLPOLICY FIRSTFIT >>>>> >>> RESERVATIONPOLICY CURRENTHIGHEST >>>>> >>> >>>>> >>> ENABLEMULTIREQJOBS TRUE >>>>> >>> NODEALLOCATIONPOLICY MINRESOURCE >>>>> >>> NODEACCESSPOLICY SINGLEJOB >>>>> >>> JOBNODEMATCHPOLICY EXACTNODE >>>>> >>> >>>>> >>> I am observing that if the first 2 jobs are running, the third one >>>>> does >>>>> >>> not start ( even though 4 nodes are available ) until 1 of the jobs >>>>> >>> complete. With checkjob -v it shows the following output : >>>>> >>> >>>>> >>> ------------------ >>>>> >>> >>>>> >>> checking job 5791 (RM job '5791.fire16.csa.local') >>>>> >>> >>>>> >>> State: Idle >>>>> >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT >>>>> >>> WallTime: 00:00:00 of 00:04:51 >>>>> >>> SubmitTime: Wed May 23 11:52:04 >>>>> >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >>>>> >>> >>>>> >>> StartDate: 00:00:01 Wed May 23 12:40:57 >>>>> >>> Total Tasks: 2 >>>>> >>> >>>>> >>> Req[0] TaskCount: 2 Partition: ALL >>>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>>> >>> NodeAccess: SINGLEJOB >>>>> >>> TasksPerNode: 2 NodeCount: 1 >>>>> >>> >>>>> >>> Req[1] TaskCount: 3 Partition: ALL >>>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>>> >>> NodeAccess: SINGLEJOB >>>>> >>> NodeCount: 3 >>>>> >>> >>>>> >>> >>>>> >>> IWD: [NONE] Executable: [NONE] >>>>> >>> Bypass: 5 StartCount: 0 >>>>> >>> PartitionMask: [ALL] >>>>> >>> Flags: RESTARTABLE >>>>> >>> >>>>> >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >>>>> >>> PE: 5.00 StartPriority: 48 >>>>> >>> cannot select job 5791 for partition DEFAULT (startdate in >>>>> '00:00:01') >>>>> >>> >>>>> >>> ------------ >>>>> >>> >>>>> >>> What could be the reason for not starting this job ? How do I >>>>> resolve >>>>> >>> this ? >>>>> >>> >>>>> >>> Thanks, >>>>> >>> Kunal >>>>> >>> >>>>> >> >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> torqueusers mailing list >>>>> >> torqueusers at supercluster.org >>>>> >> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >> >>>>> >> >>>>> > >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/914c1b38/attachment-0001.html From dbeer at adaptivecomputing.com Fri Jun 1 14:07:22 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 1 Jun 2012 14:07:22 -0600 Subject: [torqueusers] cpusets in Torque 3.0, memory pressure In-Reply-To: <1338573421.29117.YahooMailClassic@web111314.mail.gq1.yahoo.com> References: <1338573421.29117.YahooMailClassic@web111314.mail.gq1.yahoo.com> Message-ID: On Fri, Jun 1, 2012 at 11:57 AM, Grigory Shamov wrote: > Hi, > > The Torque 3 changelog says " when cpusets are configured and memory > pressure enabled, add the ability to check memory pressure for a job. Using > $memory_pressure_threshold and $memory_pressure_duration in the mom's > config, the admin sets a threshold at which a job becomes a problem". > > Does it work for non-NUMA setup as well? I mean, if I configure it with > --enable-cpusets only, but --disable-numa-support ? > > Grigory, As long as you set your cpuset up correctly ( http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.cmomconfig.php) it will work with or without numa support. David > -- > Grigory Shamov > HPC Analyst, > University of Manitoba > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/3c105ece/attachment.html From kunalgrao at gmail.com Fri Jun 1 14:34:09 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Fri, 1 Jun 2012 16:34:09 -0400 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: Found this post online : http://www.supercluster.org/pipermail/mauiusers/2010-February/004116.html I also have JOBNODEMATCHPOLICY EXACTNODE and NODEACCESSPOLICY SINGLEJOB set in the configuration. Could this bug still be there with maui ? I tested with a smaller cluster size and let me explain the scenario again : This time I have a 6 node cluster with Torque-3.0.3 and Maui running. Additional configuration in my Maui configuration file : ---------- BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST ENABLEMULTIREQJOBS TRUE JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SINGLEJOB ---------- Now I submit a job a 2 node job with following resource requirement : ---------- #PBS -l nodes=2,walltime=0:10:00 --------- This job starts on node1/0 + node2/0 Now, I submit another 4 node job with the following resource requirement : --------- #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 -------- This job is also started but with following resources : *node3/0 + node3/1*+ *node4/0 + node4/1* + node5/0 I would expect this job to use the resources as follows : node3/0 + node3/1 + node4/0 + node5/0 + node6/0 But it did not use node6 at all, instead it used node3 and node4 to put 2 procs on each of them and node5 with another proc. node6 remained idle. Is this a bug or some other configuration / setting is required ? Thanks, Kunal On Fri, Jun 1, 2012 at 3:57 PM, Kunal Rao wrote: > I removed NODEALLOCATIONPOLICY and tried again, this time it started the > job but the node allocation was not as expected. > > The job needs 1 node with 2 proc and 3 nodes with 1 proc each. The > allocation was done on only 3 nodes. 2 with 2 procs and 1 with 1 proc. Not > sure if this is a bug or some conflicts in the configuration. > > My current additional configurations are : > > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > ENABLEMULTIREQJOBS TRUE > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SINGLEJOB > > I also tried with this, but still the same : > > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > ENABLEMULTIREQJOBS TRUE > NODEALLOCATIONPOLICY PRIORITY > NODECFG[DEFAULT] PRIORITYF='APROCS' > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SINGLEJOB > > Any suggestions ? > > Thanks, > Kunal > > > > On Thu, May 31, 2012 at 10:26 PM, Kunal Rao wrote: > >> I need NODEACCESSPOLICY, maybe I'll remove NODEALLOCATIONPOLICY and check >> tomorrow. >> >> Thanks, >> Kunal >> >> >> On Thu, May 31, 2012 at 10:23 PM, Ju JiaJia wrote: >> >>> Seems all be ok. I think you could try to delete the additional >>> configuration in maui.cfg. like NODEALLOCATIONPOLICY, NODEACCESSPOLICY, >>> or use default or other options. >>> >>> >>> On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao wrote: >>> >>>> Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each >>>> of the 10 nodes : >>>> >>>> np=16 gpus=1 >>>> >>>> Thanks, >>>> Kunal >>>> >>>> >>>> On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia wrote: >>>> >>>>> How many cores on each of the 10 nodes ? I mean you are trying to >>>>> allocate 2 processors on one node. And how did you >>>>> configure TORQUE_HOME/server_priv/nodes ? >>>>> >>>>> >>>>> On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao wrote: >>>>> >>>>>> Queue / Server configuration : >>>>>> >>>>>> --------------- >>>>>> >>>>>> qmgr -c 'p s' >>>>>> # >>>>>> # Create queues and set their attributes. >>>>>> # >>>>>> # >>>>>> # Create and define queue batch >>>>>> # >>>>>> create queue batch >>>>>> set queue batch queue_type = Execution >>>>>> set queue batch resources_default.nodes = 1 >>>>>> set queue batch resources_default.walltime = 01:00:00 >>>>>> set queue batch enabled = True >>>>>> set queue batch started = True >>>>>> # >>>>>> # Set server attributes. >>>>>> # >>>>>> set server scheduling = True >>>>>> set server acl_hosts = fire16 >>>>>> set server acl_roots = root at fire16.csa.local >>>>>> set server managers = root at fire16.csa.local >>>>>> set server operators = root at fire16.csa.local >>>>>> set server default_queue = batch >>>>>> set server log_events = 511 >>>>>> set server mail_from = adm >>>>>> set server scheduler_iteration = 20 >>>>>> set server node_check_rate = 150 >>>>>> set server tcp_timeout = 6 >>>>>> set server mom_job_sync = True >>>>>> set server keep_completed = 300 >>>>>> set server allow_node_submit = True >>>>>> set server next_job_number = 6331 >>>>>> >>>>>> --------------- >>>>>> >>>>>> Job resource requirement : >>>>>> >>>>>> --------- >>>>>> >>>>>> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 >>>>>> >>>>>> --------- >>>>>> >>>>>> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all >>>>>> accessible. >>>>>> >>>>>> Thanks, >>>>>> Kunal >>>>>> >>>>>> >>>>>> On 5/31/12, Ju JiaJia wrote: >>>>>> > Please give your queue/server configuration and your job's >>>>>> resources need, >>>>>> > cpu/memory etc. And Does all the 10 nodes accessable? You can use >>>>>> pbsnodes >>>>>> > to check this. >>>>>> > >>>>>> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao >>>>>> wrote: >>>>>> > >>>>>> >> Hello, >>>>>> >> >>>>>> >> Please see the below message. I had posted it on maui users >>>>>> mailing list, >>>>>> >> but did not get any response, so thought of posting it here on >>>>>> torque >>>>>> >> users >>>>>> >> mailing list (incase someone would know). Kindly let me know if >>>>>> you have >>>>>> >> any comments / ideas / suggestions. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Kunal >>>>>> >> >>>>>> >> ---------- Forwarded message ---------- >>>>>> >> From: Kunal Rao >>>>>> >> Date: Wed, May 23, 2012 at 2:30 PM >>>>>> >> Subject: Re: Multi-req job not starting >>>>>> >> To: mauiusers at supercluster.org >>>>>> >> >>>>>> >> >>>>>> >> There was a similar post earlier : >>>>>> >> >>>>>> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html >>>>>> >> >>>>>> >> But did not find any response to it. Can anyone please provide >>>>>> some ideas >>>>>> >> / suggestion on this issue. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Kunal >>>>>> >> >>>>>> >> >>>>>> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao >>>>>> wrote: >>>>>> >> >>>>>> >>> Hello, >>>>>> >>> >>>>>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes >>>>>> ( with >>>>>> >>> 1 task per node ), another which needs 4 nodes (with 1 task per >>>>>> node) >>>>>> >>> and >>>>>> >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 >>>>>> task >>>>>> >>> each >>>>>> >>> on the other 3 nodes ). >>>>>> >>> >>>>>> >>> Additional configuration in maui.cfg is : >>>>>> >>> >>>>>> >>> BACKFILLPOLICY FIRSTFIT >>>>>> >>> RESERVATIONPOLICY CURRENTHIGHEST >>>>>> >>> >>>>>> >>> ENABLEMULTIREQJOBS TRUE >>>>>> >>> NODEALLOCATIONPOLICY MINRESOURCE >>>>>> >>> NODEACCESSPOLICY SINGLEJOB >>>>>> >>> JOBNODEMATCHPOLICY EXACTNODE >>>>>> >>> >>>>>> >>> I am observing that if the first 2 jobs are running, the third >>>>>> one does >>>>>> >>> not start ( even though 4 nodes are available ) until 1 of the >>>>>> jobs >>>>>> >>> complete. With checkjob -v it shows the following output >>>>>> : >>>>>> >>> >>>>>> >>> ------------------ >>>>>> >>> >>>>>> >>> checking job 5791 (RM job '5791.fire16.csa.local') >>>>>> >>> >>>>>> >>> State: Idle >>>>>> >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT >>>>>> >>> WallTime: 00:00:00 of 00:04:51 >>>>>> >>> SubmitTime: Wed May 23 11:52:04 >>>>>> >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >>>>>> >>> >>>>>> >>> StartDate: 00:00:01 Wed May 23 12:40:57 >>>>>> >>> Total Tasks: 2 >>>>>> >>> >>>>>> >>> Req[0] TaskCount: 2 Partition: ALL >>>>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>>>> >>> NodeAccess: SINGLEJOB >>>>>> >>> TasksPerNode: 2 NodeCount: 1 >>>>>> >>> >>>>>> >>> Req[1] TaskCount: 3 Partition: ALL >>>>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>>>> >>> NodeAccess: SINGLEJOB >>>>>> >>> NodeCount: 3 >>>>>> >>> >>>>>> >>> >>>>>> >>> IWD: [NONE] Executable: [NONE] >>>>>> >>> Bypass: 5 StartCount: 0 >>>>>> >>> PartitionMask: [ALL] >>>>>> >>> Flags: RESTARTABLE >>>>>> >>> >>>>>> >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >>>>>> >>> PE: 5.00 StartPriority: 48 >>>>>> >>> cannot select job 5791 for partition DEFAULT (startdate in >>>>>> '00:00:01') >>>>>> >>> >>>>>> >>> ------------ >>>>>> >>> >>>>>> >>> What could be the reason for not starting this job ? How do I >>>>>> resolve >>>>>> >>> this ? >>>>>> >>> >>>>>> >>> Thanks, >>>>>> >>> Kunal >>>>>> >>> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> _______________________________________________ >>>>>> >> torqueusers mailing list >>>>>> >> torqueusers at supercluster.org >>>>>> >> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>> >> >>>>>> >> >>>>>> > >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/3a751d45/attachment-0001.html From gas5x at yahoo.com Sun Jun 3 07:47:16 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Sun, 3 Jun 2012 06:47:16 -0700 (PDT) Subject: [torqueusers] cpusets in Torque 3.0, memory pressure In-Reply-To: Message-ID: <1338731236.31451.YahooMailClassic@web111301.mail.gq1.yahoo.com> Dear David, Thanks a lot for your ?answer!? What would be a good starting value for the memory pressure threshold, to configure it right?? --Grigory Shamov --- On Fri, 6/1/12, David Beer wrote: From: David Beer Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure To: "Torque Users Mailing List" Date: Friday, June 1, 2012, 1:07 PM On Fri, Jun 1, 2012 at 11:57 AM, Grigory Shamov wrote: Hi, The Torque 3 changelog says " when cpusets are configured and memory pressure enabled, add the ability to check memory pressure for a job. Using $memory_pressure_threshold and $memory_pressure_duration in the mom's config, the admin sets a threshold at? which a job becomes a problem". Does it work for non-NUMA setup as well? I mean, if I configure it with --enable-cpusets only, but --disable-numa-support ? Grigory, As long as you set your cpuset up correctly (http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.cmomconfig.php) it will work with or without numa support. David? -- Grigory Shamov HPC Analyst, University of Manitoba _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software EngineerAdaptive Computing -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120603/646c7c12/attachment.html From shantanugadgil at yahoo.com Mon Jun 4 04:40:01 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Mon, 4 Jun 2012 03:40:01 -0700 (PDT) Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <20120601003856.GE26901@lbl.gov> Message-ID: <1338806401.36170.YahooMailClassic@web120606.mail.ne1.yahoo.com> --- On Fri, 6/1/12, Michael Jennings wrote: > From: Michael Jennings > > Yes, I use "./configure --prefix=/usr --enable-gui > --with-tcl --with-tk" > > > > ('/usr' is just my preference, that's all) > > Can you send the output of the "rpmbuild" that fails?? > Or pastebin it > somewhere and send the URL? Hi, Sorry for the confusion; I tried the process step-by-step on a clean setup of CentOS 6.2 x86_64 and faced no issues for the GHUI packages...weird!!! The default install didn't have the necessary devel packages, but that's all... no problems after that. I am extremely confused how come the previous machine reported the problem and the problem went away after the 32bit package installation. Anyway, all in all, on a clean setup, the rpmbuild "just works", no problems! Again, sorry about any confusion created. Something must be borked in my other machine's OS installation! :) Regards, Shantanu > > Thanks, > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E? ? ? ? W: 510-495-2687 > MS 050B-3209? ? ? ? ? F: > 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From dbeer at adaptivecomputing.com Mon Jun 4 08:27:33 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 4 Jun 2012 08:27:33 -0600 Subject: [torqueusers] cpusets in Torque 3.0, memory pressure In-Reply-To: <1338731236.31451.YahooMailClassic@web111301.mail.gq1.yahoo.com> References: <1338731236.31451.YahooMailClassic@web111301.mail.gq1.yahoo.com> Message-ID: On Sun, Jun 3, 2012 at 7:47 AM, Grigory Shamov wrote: > Dear David, > > > Thanks a lot for your answer! > > > What would be a good starting value for the memory pressure threshold, to > configure it right? > I think 1000 is a good starting value. David > > -- > > Grigory Shamov > > > > > --- On *Fri, 6/1/12, David Beer * wrote: > > > From: David Beer > Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure > To: "Torque Users Mailing List" > Date: Friday, June 1, 2012, 1:07 PM > > > > > On Fri, Jun 1, 2012 at 11:57 AM, Grigory Shamov > > wrote: > > Hi, > > The Torque 3 changelog says " when cpusets are configured and memory > pressure enabled, add the ability to check memory pressure for a job. Using > $memory_pressure_threshold and $memory_pressure_duration in the mom's > config, the admin sets a threshold at which a job becomes a problem". > > Does it work for non-NUMA setup as well? I mean, if I configure it with > --enable-cpusets only, but --disable-numa-support ? > > > Grigory, > > As long as you set your cpuset up correctly ( > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.cmomconfig.php) > it will work with or without numa support. > > David > > > -- > Grigory Shamov > HPC Analyst, > University of Manitoba > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > -----Inline Attachment Follows----- > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120604/6c6dbb81/attachment.html From damianmontaldo at gmail.com Fri Jun 1 13:29:16 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Fri, 1 Jun 2012 16:29:16 -0300 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: References: Message-ID: Hi, I need some help with Torque and a specific option of OpenMPI I'm sending this email twice because the firt one looks like it didn't arrive. I have nodes with 4 processors each and I want to launch only one process in each node using the pernode option because I need restrict that torque is not going to queue another jobs in that node. As the manual says: On each node, launch one process (-- equivalent to -npernode 1) This is the error I got. I try to google it but a segmentation fault it's a very common error and it's very common too to found it related to the binary (executed by mpiexec) and I think that this is a specific Torque error because running mpirun with the host file and the pernode it seems to work. $ cat TEST.e37495 [n52:04352] *** Process received signal *** [n52:04352] Signal: Segmentation fault (11) [n52:04352] Signal code: Address not mapped (1) [n52:04352] Failing at address: 0x50 [n52:04352] [ 0] /lib/libpthread.so.0(+0xeff0) [0x2aca79ff4ff0] [n52:04352] [ 1] /usr/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbc) [0x2aca792c334c] [n52:04352] [ 2] /usr/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d4) [0x2aca792d1ea4] [n52:04352] [ 3] /usr/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x11e) [0x2aca792d596e] [n52:04352] [ 4] /usr/lib/openmpi/lib/openmpi/mca_plm_tm.so(+0x1d4a) [0x2aca7b382d4a] [n52:04352] [ 5] mpiexec() [0x403aaf] [n52:04352] [ 6] mpiexec() [0x402f74] [n52:04352] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x2aca7a220c8d] [n52:04352] [ 8] mpiexec() [0x402e99] [n52:04352] *** End of error message *** /var/spool/torque/mom_priv/jobs/37495....SC: line 107: ?4352 Segmentation fault ? ? ?mpiexec -verbose -pernode -np $NP python ..args... [n48:15977] [[10692,0],2] routed:binomial: Connection to lifeline [[10692,0],0] lost [n49:15992] [[10692,0],1] routed:binomial: Connection to lifeline [[10692,0],0] lost [n42:16290] [[10692,0],3] routed:binomial: Connection to lifeline [[10692,0],0] lost $ qstat -f 37495 Job Id: 37495 ? ?Job_Name = TEST ? ?resources_used.cput = 00:00:00 ? ?resources_used.mem = 532kb ? ?resources_used.vmem = 9056kb ? ?resources_used.walltime = 00:00:01 ? ?job_state = C ? ?queue = batch ? ?server = n0 ? ?Checkpoint = u ? ?ctime = Thu May 31 20:42:47 2012 ? ?exec_host = n52/3+n52/2+n52/1+n52/0+n49/3+n49/2+n49/1+n49/0+n48/3+n48/2+n4 ? ? ? ?8/1+n48/0+n42/3+n42/2+n42/1+n42/0 ? ?Hold_Types = n ? ?Join_Path = n ? ?Keep_Files = eo ? ?Mail_Points = abe ? ?mtime = Thu May 31 20:43:21 2012 ? ?Priority = 0 ? ?qtime = Thu May 31 20:42:47 2012 ? ?Rerunable = True ? ?Resource_List.nodect = 4 ? ?Resource_List.nodes = 4:ppn=4 ? ?Resource_List.walltime = 01:00:00 ? ?session_id = 4342 ? ?Variable_List = PBS_O_LANG=es_AR.UTF-8, ? ? ? ?PBS_O_LOGNAME=dfslezak, ? ? ? ?PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games, ? ? ? ?PBS_O_SHELL=/bin/bash,PBS_SERVER=n0, ? ? ? ?PBS_O_QUEUE=batch, ? ? ? ?PBS_O_HOST=n0 ? ?comment = Job started on Thu May 31 at 20:43 ? ?etime = Thu May 31 20:42:47 2012 ? ?exit_status = 0 ? ?submit_args = -l walltime=1:00:00 ? ?start_time = Thu May 31 20:43:21 2012 ? ?Walltime.Remaining = 360 ? ?start_count = 1 ? ?fault_tolerant = False ? ?comp_time = Thu May 31 20:43:21 2012 $ mpiexec --version mpiexec (OpenRTE) 1.4.2 I doesn't to be related to python but this is the version. $ python --version Python 2.6.6 It a Debian Linux (squeeze up to date) with this Torque version $ dpkg -l | grep torque ii ?libtorque2 ? ? ? ? ? ? 2.4.8+dfsg-9squeeze1 ? shared library for Torque client and server ii ?torque-client ? ? ? ? 2.4.8+dfsg-9squeeze1 ? command line interface to Torque server ii ?torque-common ? ?2.4.8+dfsg-9squeeze1 ? Torque Queueing System shared files ii ?torque-mom ? ? ? ? 2.4.8+dfsg-9squeeze1 ? job execution engine for Torque batch system ii ?torque-scheduler ? 2.4.8+dfsg-9squeeze1 ? scheduler part of Torque ii ?torque-server ? ? ? ?2.4.8+dfsg-9squeeze1 ? PBS-derived batch processing server If you need more specific info (perhaps a qmgr print server?) just tell, and of course, any help would be very appreciated! From jwkeller at alaska.edu Mon Jun 4 00:32:10 2012 From: jwkeller at alaska.edu (John Keller) Date: Sun, 3 Jun 2012 22:32:10 -0800 Subject: [torqueusers] Does Torque-2.5.11 require glibc v 2.14? Message-ID: I am trying to run pbs_mom on a node with openSUSE 11.4, which has glibc v 2.11. I get an error message: pbs_mom: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by pbs_mom) pbs_mom: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /usr/local/lib64/libtorque.so.2) I read on the web that glibc v 2.14 is not compatible with openSUSE 11.4. sheesh! Maybe I should go back to an earlier version of Torque. If so, which one do you recommend? John Keller From knielson at adaptivecomputing.com Mon Jun 4 09:10:38 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 4 Jun 2012 09:10:38 -0600 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: References: Message-ID: What version of TORQUE? Ken On Thu, May 31, 2012 at 6:30 PM, Damian Montaldo wrote: > Hi, I need some help with Torque and a specific option of OpenMPI. > > I have nodes with 4 processors each and I want to launch only one > process in each node using the pernode option because I need restrict > that torque is not going to queue another jobs in that node. > As the manual says: On each node, launch one process (-- equivalent to > -npernode 1) > > This is the error I got. I try to google it but a segmentation fault > it's a very common error and it's very common too to found it related > to the binary (executed by mpiexec) and I think that this is a > specific Torque error because running mpirun with the host file and > the pernode it seems to work. > > $ cat TEST.e37495 > [n52:04352] *** Process received signal *** > [n52:04352] Signal: Segmentation fault (11) > [n52:04352] Signal code: Address not mapped (1) > [n52:04352] Failing at address: 0x50 > [n52:04352] [ 0] /lib/libpthread.so.0(+0xeff0) [0x2aca79ff4ff0] > [n52:04352] [ 1] > /usr/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbc) > [0x2aca792c334c] > [n52:04352] [ 2] > /usr/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d4) > [0x2aca792d1ea4] > [n52:04352] [ 3] > /usr/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x11e) > [0x2aca792d596e] > [n52:04352] [ 4] /usr/lib/openmpi/lib/openmpi/mca_plm_tm.so(+0x1d4a) > [0x2aca7b382d4a] > [n52:04352] [ 5] mpiexec() [0x403aaf] > [n52:04352] [ 6] mpiexec() [0x402f74] > [n52:04352] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x2aca7a220c8d] > [n52:04352] [ 8] mpiexec() [0x402e99] > [n52:04352] *** End of error message *** > /var/spool/torque/mom_priv/jobs/37495....SC: line 107: 4352 > Segmentation fault mpiexec -verbose -pernode -np $NP python > ..args... > [n48:15977] [[10692,0],2] routed:binomial: Connection to lifeline > [[10692,0],0] lost > [n49:15992] [[10692,0],1] routed:binomial: Connection to lifeline > [[10692,0],0] lost > [n42:16290] [[10692,0],3] routed:binomial: Connection to lifeline > [[10692,0],0] lost > > $ qstat -f 37495 > Job Id: 37495 > Job_Name = TEST > resources_used.cput = 00:00:00 > resources_used.mem = 532kb > resources_used.vmem = 9056kb > resources_used.walltime = 00:00:01 > job_state = C > queue = batch > server = n0 > Checkpoint = u > ctime = Thu May 31 20:42:47 2012 > exec_host = > n52/3+n52/2+n52/1+n52/0+n49/3+n49/2+n49/1+n49/0+n48/3+n48/2+n4 > 8/1+n48/0+n42/3+n42/2+n42/1+n42/0 > Hold_Types = n > Join_Path = n > Keep_Files = eo > Mail_Points = abe > mtime = Thu May 31 20:43:21 2012 > Priority = 0 > qtime = Thu May 31 20:42:47 2012 > Rerunable = True > Resource_List.nodect = 4 > Resource_List.nodes = 4:ppn=4 > Resource_List.walltime = 01:00:00 > session_id = 4342 > Variable_List = PBS_O_LANG=es_AR.UTF-8, > PBS_O_LOGNAME=dfslezak, > PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games, > PBS_O_SHELL=/bin/bash,PBS_SERVER=n0, > PBS_O_QUEUE=batch, > PBS_O_HOST=n0 > comment = Job started on Thu May 31 at 20:43 > etime = Thu May 31 20:42:47 2012 > exit_status = 0 > submit_args = -l walltime=1:00:00 > start_time = Thu May 31 20:43:21 2012 > Walltime.Remaining = 360 > start_count = 1 > fault_tolerant = False > comp_time = Thu May 31 20:43:21 2012 > > $ mpiexec --version > mpiexec (OpenRTE) 1.4.2 > > I doesn't to be related to python but this is the version. > $ python --version > Python 2.6.6 > > It a Debian Linux (squeeze up to date) with this Torque version > $ dpkg -l | grep torque > ii libtorque2 2.4.8+dfsg-9squeeze1 shared library for > Torque client and server > ii torque-client 2.4.8+dfsg-9squeeze1 command line > interface to Torque server > ii torque-common 2.4.8+dfsg-9squeeze1 Torque Queueing System shared > files > ii torque-mom 2.4.8+dfsg-9squeeze1 job execution engine for > Torque batch system > ii torque-scheduler 2.4.8+dfsg-9squeeze1 scheduler part of Torque > ii torque-server 2.4.8+dfsg-9squeeze1 PBS-derived batch > processing server > > If you need more specific info (perhaps a qmgr print server?) just > tell, and of course, any help would be very appreciated! > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120604/bc7259d1/attachment.html From damianmontaldo at gmail.com Mon Jun 4 09:23:27 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Mon, 4 Jun 2012 12:23:27 -0300 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 12:10 PM, Ken Nielson wrote: > What version of TORQUE? I'm using the stable version in Debian Squeeze http://packages.debian.org/squeeze/torque-server It says 2.4.8 Thanks. > On Thu, May 31, 2012 at 6:30 PM, Damian Montaldo > wrote: >> It a Debian Linux (squeeze up to date) with this Torque version >> $ dpkg -l | grep torque >> ii ?libtorque2 ? ? ? ? ? ? 2.4.8+dfsg-9squeeze1 ? shared library for >> Torque client and server >> ii ?torque-client ? ? ? ? 2.4.8+dfsg-9squeeze1 ? command line >> interface to Torque server >> ii ?torque-common ? ?2.4.8+dfsg-9squeeze1 ? Torque Queueing System shared >> files >> ii ?torque-mom ? ? ? ? 2.4.8+dfsg-9squeeze1 ? job execution engine for >> Torque batch system >> ii ?torque-scheduler ? 2.4.8+dfsg-9squeeze1 ? scheduler part of Torque >> ii ?torque-server ? ? ? ?2.4.8+dfsg-9squeeze1 ? PBS-derived batch >> processing server >> >> If you need more specific info (perhaps a qmgr print server?) just >> tell, and of course, any help would be very appreciated! From akohlmey at cmm.chem.upenn.edu Mon Jun 4 09:28:02 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 4 Jun 2012 11:28:02 -0400 Subject: [torqueusers] Does Torque-2.5.11 require glibc v 2.14? In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 2:32 AM, John Keller wrote: > I am trying ?to run pbs_mom on a node with openSUSE 11.4, which has > glibc v 2.11. I get an error message: > > pbs_mom: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by pbs_mom) > pbs_mom: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by > /usr/local/lib64/libtorque.so.2) > > I read on the web that glibc v 2.14 is not compatible with openSUSE > 11.4. sheesh! Maybe I should go back to an earlier version of Torque. > If so, which one do you recommend? this has nothing to do with the version of torque, but rather the version of glibc that was present on the machine that this particular set of binaries was compiled. if you want to use the same set of torque binaries across a wide range of machines, you have to compile them on the machine with the "oldest" glibc installation. glibc is backward, but not forward compatible. axel. > ?John Keller > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From gus at ldeo.columbia.edu Mon Jun 4 09:48:30 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 04 Jun 2012 11:48:30 -0400 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: References: Message-ID: <4FCCD8CE.6040309@ldeo.columbia.edu> Hi Damian Did you build your OpenMPI with Torque support? Or did you install it from a Debian package? The Debian OpenMPI package [if it exists] may not have Torque support. In this case, the mpiexec/mpirun probably won't know how to coordinate with Torque regarding nodes, cores, resources, etc. You can do 'mpicc --showme' to see if '-ltorque' appears there. I hope this helps, Gus Correa On 06/01/2012 03:29 PM, Damian Montaldo wrote: > Hi, I need some help with Torque and a specific option of OpenMPI > I'm sending this email twice because the firt one looks like it didn't arrive. > > I have nodes with 4 processors each and I want to launch only one > process in each node using the pernode option because I need restrict > that torque is not going to queue another jobs in that node. > As the manual says: On each node, launch one process (-- equivalent to > -npernode 1) > > This is the error I got. I try to google it but a segmentation fault > it's a very common error and it's very common too to found it related > to the binary (executed by mpiexec) and I think that this is a > specific Torque error because running mpirun with the host file and > the pernode it seems to work. > > $ cat TEST.e37495 > [n52:04352] *** Process received signal *** > [n52:04352] Signal: Segmentation fault (11) > [n52:04352] Signal code: Address not mapped (1) > [n52:04352] Failing at address: 0x50 > [n52:04352] [ 0] /lib/libpthread.so.0(+0xeff0) [0x2aca79ff4ff0] > [n52:04352] [ 1] > /usr/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbc) > [0x2aca792c334c] > [n52:04352] [ 2] > /usr/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d4) > [0x2aca792d1ea4] > [n52:04352] [ 3] > /usr/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x11e) > [0x2aca792d596e] > [n52:04352] [ 4] /usr/lib/openmpi/lib/openmpi/mca_plm_tm.so(+0x1d4a) > [0x2aca7b382d4a] > [n52:04352] [ 5] mpiexec() [0x403aaf] > [n52:04352] [ 6] mpiexec() [0x402f74] > [n52:04352] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x2aca7a220c8d] > [n52:04352] [ 8] mpiexec() [0x402e99] > [n52:04352] *** End of error message *** > /var/spool/torque/mom_priv/jobs/37495....SC: line 107: 4352 > Segmentation fault mpiexec -verbose -pernode -np $NP python > ..args... > [n48:15977] [[10692,0],2] routed:binomial: Connection to lifeline > [[10692,0],0] lost > [n49:15992] [[10692,0],1] routed:binomial: Connection to lifeline > [[10692,0],0] lost > [n42:16290] [[10692,0],3] routed:binomial: Connection to lifeline > [[10692,0],0] lost > > $ qstat -f 37495 > Job Id: 37495 > Job_Name = TEST > resources_used.cput = 00:00:00 > resources_used.mem = 532kb > resources_used.vmem = 9056kb > resources_used.walltime = 00:00:01 > job_state = C > queue = batch > server = n0 > Checkpoint = u > ctime = Thu May 31 20:42:47 2012 > exec_host = n52/3+n52/2+n52/1+n52/0+n49/3+n49/2+n49/1+n49/0+n48/3+n48/2+n4 > 8/1+n48/0+n42/3+n42/2+n42/1+n42/0 > Hold_Types = n > Join_Path = n > Keep_Files = eo > Mail_Points = abe > mtime = Thu May 31 20:43:21 2012 > Priority = 0 > qtime = Thu May 31 20:42:47 2012 > Rerunable = True > Resource_List.nodect = 4 > Resource_List.nodes = 4:ppn=4 > Resource_List.walltime = 01:00:00 > session_id = 4342 > Variable_List = PBS_O_LANG=es_AR.UTF-8, > PBS_O_LOGNAME=dfslezak, > PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games, > PBS_O_SHELL=/bin/bash,PBS_SERVER=n0, > PBS_O_QUEUE=batch, > PBS_O_HOST=n0 > comment = Job started on Thu May 31 at 20:43 > etime = Thu May 31 20:42:47 2012 > exit_status = 0 > submit_args = -l walltime=1:00:00 > start_time = Thu May 31 20:43:21 2012 > Walltime.Remaining = 360 > start_count = 1 > fault_tolerant = False > comp_time = Thu May 31 20:43:21 2012 > > $ mpiexec --version > mpiexec (OpenRTE) 1.4.2 > > I doesn't to be related to python but this is the version. > $ python --version > Python 2.6.6 > > It a Debian Linux (squeeze up to date) with this Torque version > $ dpkg -l | grep torque > ii libtorque2 2.4.8+dfsg-9squeeze1 shared library for > Torque client and server > ii torque-client 2.4.8+dfsg-9squeeze1 command line > interface to Torque server > ii torque-common 2.4.8+dfsg-9squeeze1 Torque Queueing System shared files > ii torque-mom 2.4.8+dfsg-9squeeze1 job execution engine for > Torque batch system > ii torque-scheduler 2.4.8+dfsg-9squeeze1 scheduler part of Torque > ii torque-server 2.4.8+dfsg-9squeeze1 PBS-derived batch > processing server > > If you need more specific info (perhaps a qmgr print server?) just > tell, and of course, any help would be very appreciated! > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From damianmontaldo at gmail.com Mon Jun 4 09:58:48 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Mon, 4 Jun 2012 12:58:48 -0300 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: <4FCCD8CE.6040309@ldeo.columbia.edu> References: <4FCCD8CE.6040309@ldeo.columbia.edu> Message-ID: On Mon, Jun 4, 2012 at 12:48 PM, Gus Correa wrote: > Hi Damian > > Did you build your OpenMPI with Torque support? > Or did you install it from a Debian package? Hig Gus, I just install de Debian package of torque an openmpi (both form the Debian repository) > The Debian OpenMPI package [if it exists] may not > have Torque support. > In this case, the mpiexec/mpirun probably won't > know how to coordinate with Torque regarding > nodes, cores, resources, etc. > > You can do > 'mpicc --showme' > to see if > '-ltorque' > appears there. > > I hope this helps, > Gus Correa $ mpicc --showme gcc -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl You're right, thanks a lot! I'll try build it from source code instead of using the package from Debian. If I cloud solve it I'll post it here. Thanks again. From damianmontaldo at gmail.com Mon Jun 4 10:44:05 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Mon, 4 Jun 2012 13:44:05 -0300 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: References: <4FCCD8CE.6040309@ldeo.columbia.edu> Message-ID: On Mon, Jun 4, 2012 at 12:58 PM, Damian Montaldo wrote: > On Mon, Jun 4, 2012 at 12:48 PM, Gus Correa wrote: >> Hi Damian >> >> Did you build your OpenMPI with Torque support? >> Or did you install it from a Debian package? > > Hig Gus, I just install de Debian package of torque an openmpi (both > form the Debian repository) > >> The Debian OpenMPI package [if it exists] may not >> have Torque support. >> In this case, the mpiexec/mpirun probably won't >> know how to coordinate with Torque regarding >> nodes, cores, resources, etc. >> >> You can do >> 'mpicc --showme' >> to see if >> '-ltorque' >> appears there. >> >> I hope this helps, >> Gus Correa > > $ mpicc --showme > gcc -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi > -pthread -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lopen-pal -ldl > -Wl,--export-dynamic -lnsl -lutil -lm -ldl > > You're right, thanks a lot! > > I'll try build it from source code instead of using the package from Debian. > If I cloud solve it I'll post it here. I was looking forward to report this bug but it was reported before here http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592887 Build with support for Torque (except on HURD). I've installed that version (1.4.2-4) but it seems to lack of support. I'll continue this issue in the debian related lists or in the OpemMPI list. Thanks. From gus at ldeo.columbia.edu Mon Jun 4 11:09:14 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 04 Jun 2012 13:09:14 -0400 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option In-Reply-To: References: <4FCCD8CE.6040309@ldeo.columbia.edu> Message-ID: <4FCCEBBA.4070205@ldeo.columbia.edu> Hi Damian I am not a Debian user, but I would guess it is unlikely that the Debian packages will add Torque support. This is because OpenMPI can support SGE, Slurm, and other resource managers, and you probably cannot add support for all of them at the same time. Which resource manager one chooses is a matter of taste. It may be easier to just uninstall the Debian OpenMPI packages and install OpenMPI from source, with Torque support, in a non-system directory [/usr/local/openmpi, /opt/openmpi-X.Y.Z, ...] Something like this: ./configure --prefix=/a/non-system/directory --with-tm=/your/torque/directory ... make make install Uninstalling the existent OpenMPI packages will save you headaches with inconsistent/duplicate/mixed binaries, libraries, paths, etc. Then add the new OpenMPI directories to the appropriate environment variables [PATH and LD_LIBRARY_PATH] the way you prefer [say, via .bashrc, .tcshrc] I hope this helps, Gus Correa On 06/04/2012 12:44 PM, Damian Montaldo wrote: > On Mon, Jun 4, 2012 at 12:58 PM, Damian Montaldo > wrote: >> On Mon, Jun 4, 2012 at 12:48 PM, Gus Correa wrote: >>> Hi Damian >>> >>> Did you build your OpenMPI with Torque support? >>> Or did you install it from a Debian package? >> >> Hig Gus, I just install de Debian package of torque an openmpi (both >> form the Debian repository) >> >>> The Debian OpenMPI package [if it exists] may not >>> have Torque support. >>> In this case, the mpiexec/mpirun probably won't >>> know how to coordinate with Torque regarding >>> nodes, cores, resources, etc. >>> >>> You can do >>> 'mpicc --showme' >>> to see if >>> '-ltorque' >>> appears there. >>> >>> I hope this helps, >>> Gus Correa >> >> $ mpicc --showme >> gcc -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi >> -pthread -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lopen-pal -ldl >> -Wl,--export-dynamic -lnsl -lutil -lm -ldl >> >> You're right, thanks a lot! >> >> I'll try build it from source code instead of using the package from Debian. >> If I cloud solve it I'll post it here. > > I was looking forward to report this bug but it was reported before here > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=592887 > Build with support for Torque (except on HURD). > > I've installed that version (1.4.2-4) but it seems to lack of support. > I'll continue this issue in the debian related lists or in the OpemMPI list. > > Thanks. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jwkeller at alaska.edu Mon Jun 4 12:12:33 2012 From: jwkeller at alaska.edu (John Keller) Date: Mon, 4 Jun 2012 10:12:33 -0800 Subject: [torqueusers] Does Torque-2.5.11 require glibc v 2.14? In-Reply-To: References: Message-ID: <5225811295809270378@unknownmsgid> Thank you, Alex. I will try recompiling torque on an older machine. John K Sent from my iPad On Jun 4, 2012, at 7:28 AM, Axel Kohlmeyer wrote: > On Mon, Jun 4, 2012 at 2:32 AM, John Keller wrote: >> I am trying to run pbs_mom on a node with openSUSE 11.4, which has >> glibc v 2.11. I get an error message: >> >> pbs_mom: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by pbs_mom) >> pbs_mom: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by >> /usr/local/lib64/libtorque.so.2) >> >> I read on the web that glibc v 2.14 is not compatible with openSUSE >> 11.4. sheesh! Maybe I should go back to an earlier version of Torque. >> If so, which one do you recommend? > > this has nothing to do with the version of torque, > but rather the version of glibc that was present > on the machine that this particular set of binaries > was compiled. if you want to use the same set of > torque binaries across a wide range of machines, > you have to compile them on the machine with > the "oldest" glibc installation. glibc is backward, > but not forward compatible. > > axel. > >> John Keller >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Wed Jun 6 13:55:47 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 6 Jun 2012 13:55:47 -0600 Subject: [torqueusers] Subversion URL changing for TORQUE Message-ID: Hi everyone, We are changing the URL to check out code for TORQUE to svn:// opensvn.adaptivecomputing.com/torque. The current URL svn://clusterresources.com/torque will still be available but will be removed soon. We will let you know a date as soon as we can. In the mean time please download source from svn:// opensvn.adaptivecomputing.com/torque. Thanks Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120606/e3871df9/attachment-0001.html From knielson at adaptivecomputing.com Wed Jun 6 14:59:56 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 6 Jun 2012 14:59:56 -0600 Subject: [torqueusers] Update on source URL for TORQUE Message-ID: Hi All, In a previous e-mail we announced that the new subversion URL to download TORQUE source will be svn://opensvn.adaptivecomputing.com/torque. The subversion URL svn://clusterresources.com/torque is still available but will be decommissioned on June 20, 2012. Please start using the svn://opensvn.adaptivecomputing.com/torque URL and update any documentation you may have. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120606/de898c2b/attachment.html From abhig at princeton.edu Wed Jun 6 15:43:25 2012 From: abhig at princeton.edu (Abhishek Gupta) Date: Wed, 06 Jun 2012 17:43:25 -0400 Subject: [torqueusers] route_destination seems to be inefficient Message-ID: <4FCFCEFD.9010702@princeton.edu> Hi All, I was trying to understand route_destination attribute and as per documentation, the ordering of queue matters for queue assignment. So, if a queue does not have any available resources at that moment but resources_max, resources_min, acls etc. attributes match, job will be assigned to that queue even if there are more queues in the order which has resources available with matching attributes. I think it should first scan all the queues, see which queues match the attributes and then assign jobs the queue as per order to the queue having nodes with maximum possible available resources at that moment. How can I configure my PBS to behave like this? Thanks, Abhishek Gupta. Physics, Princeton University. From gas5x at yahoo.com Wed Jun 6 22:44:57 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Wed, 6 Jun 2012 21:44:57 -0700 (PDT) Subject: [torqueusers] cpusets in Torque 3.0, memory pressure In-Reply-To: Message-ID: <1339044297.54959.YahooMailClassic@web111312.mail.gq1.yahoo.com> Hi David, Thanks! ?I've tried to run some chemistry codes,? under Torque 3, with and without oversubscription for memory. It seems to work. The pressure? can quickly spike up to, say, 13000 with quite moderate swapping. If there are more than one jobs on the node, and one of them oversubscribes, would it create pressure for all the jobs there? -- Grigory Shamov --- On Mon, 6/4/12, David Beer wrote: From: David Beer Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure To: "Torque Users Mailing List" Date: Monday, June 4, 2012, 7:27 AM On Sun, Jun 3, 2012 at 7:47 AM, Grigory Shamov wrote: Dear David, Thanks a lot for your ?answer!? What would be a good starting value for the memory pressure threshold, to configure it right?? I think 1000 is a good starting value. David ? --Grigory Shamov --- On Fri, 6/1/12, David Beer wrote: From: David Beer Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure To: "Torque Users Mailing List" Date: Friday, June 1, 2012, 1:07 PM On Fri, Jun 1, 2012 at 11:57 AM, Grigory Shamov wrote: Hi, The Torque 3 changelog says " when cpusets are configured and memory pressure enabled, add the ability to check memory pressure for a job. Using $memory_pressure_threshold and $memory_pressure_duration in the mom's config, the admin sets a threshold at? which a job becomes a problem". Does it work for non-NUMA setup as well? I mean, if I configure it with --enable-cpusets only, but --disable-numa-support ? Grigory, As long as you set your cpuset up correctly (http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.cmomconfig.php) it will work with or without numa support. David? -- Grigory Shamov HPC Analyst, University of Manitoba _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software EngineerAdaptive Computing -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software EngineerAdaptive Computing -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120606/b242bcd6/attachment.html From h.schulz at hzdr.de Thu Jun 7 00:08:09 2012 From: h.schulz at hzdr.de (Dr. Henrik Schulz) Date: Thu, 7 Jun 2012 08:08:09 +0200 Subject: [torqueusers] redirect job output Message-ID: <3CCA7C65-F681-4A72-9679-1A5B5A609989@hzdr.de> Dear all, we are using torque 3.0.3 and one of our users has the problem that he wants to redirect the output stream of a parallel job or of an array of jobs not into .o and .e files all the time, but into one single file (containing the output of all mpi processes or all jobs in an array) for some parts of his output. Therefore he redirects the output streams using freopen(), but since he also wants to switch back to .o and .e output files he tries to set back to tty, but here we get an "no such device" error. If we submit an interactive job, tty works. So, the question is: what character device is used for the output stream of a PBS job? Thanks a lot in advance! Henrik Schulz From dbeer at adaptivecomputing.com Thu Jun 7 09:51:14 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 7 Jun 2012 09:51:14 -0600 Subject: [torqueusers] cpusets in Torque 3.0, memory pressure In-Reply-To: <1339044297.54959.YahooMailClassic@web111312.mail.gq1.yahoo.com> References: <1339044297.54959.YahooMailClassic@web111312.mail.gq1.yahoo.com> Message-ID: On Wed, Jun 6, 2012 at 10:44 PM, Grigory Shamov wrote: > Hi David, > > Thanks! > > I've tried to run some chemistry codes, under Torque 3, with and without > oversubscription for memory. It seems to work. The pressure can quickly > spike up to, say, 13000 with quite moderate swapping. > > If there are more than one jobs on the node, and one of them > oversubscribes, would it create pressure for all the jobs there? > I don't think that it would, but I would verify that experimentally. David > > -- > Grigory Shamov > > > > --- On *Mon, 6/4/12, David Beer * wrote: > > > From: David Beer > Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure > To: "Torque Users Mailing List" > Date: Monday, June 4, 2012, 7:27 AM > > > > > On Sun, Jun 3, 2012 at 7:47 AM, Grigory Shamov > > wrote: > > Dear David, > > > Thanks a lot for your answer! > > > What would be a good starting value for the memory pressure threshold, to > configure it right? > > > I think 1000 is a good starting value. > > David > > > > -- > > Grigory Shamov > > > > > --- On *Fri, 6/1/12, David Beer > >* wrote: > > > From: David Beer > > > Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure > To: "Torque Users Mailing List" > > > Date: Friday, June 1, 2012, 1:07 PM > > > > > On Fri, Jun 1, 2012 at 11:57 AM, Grigory Shamov > > wrote: > > Hi, > > The Torque 3 changelog says " when cpusets are configured and memory > pressure enabled, add the ability to check memory pressure for a job. Using > $memory_pressure_threshold and $memory_pressure_duration in the mom's > config, the admin sets a threshold at which a job becomes a problem". > > Does it work for non-NUMA setup as well? I mean, if I configure it with > --enable-cpusets only, but --disable-numa-support ? > > > Grigory, > > As long as you set your cpuset up correctly ( > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.cmomconfig.php) > it will work with or without numa support. > > David > > > -- > Grigory Shamov > HPC Analyst, > University of Manitoba > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > -----Inline Attachment Follows----- > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > -----Inline Attachment Follows----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/218b7388/attachment-0001.html From knielson at adaptivecomputing.com Thu Jun 7 10:00:04 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 7 Jun 2012 10:00:04 -0600 Subject: [torqueusers] Jobs limited to one per node In-Reply-To: References: Message-ID: On Tue, May 15, 2012 at 9:12 AM, Josh Nielsen wrote: > Hello, > > I noticed recently on our Torque cluster (3.0.2) that it is only allowing > one job per node and it is only assigning one CPU core for each job even > though there are eight per node (so it is not maxing out the resources - > and is wasting/not utilizing seven cores per node). After looking around > for a while I found a comment elsewhere on this mailing list about > compiling torque with the --enable-cpuset flag. I read the Torque page > about cpusets but am none the wiser about whether that is a required > feature to allow, what I would have thought to be default functionality of > allowing, more than one process/job to run on a node (and to utilize more > than one core per job). > Josh, npp is getting treated as a feature and you do not have that as a feature. What you really want is ppn. echo "sleep 60; echo test" | qsub -l nodes=1:ppn=1 This should fix your problem. Ken > > If I specify any npp=* value with qsub, even if only one (like echo "sleep > 60; echo test" | qsub -l nodes=1:npp=1), I get the message "qsub: Job > exceeds queue resource limits MSG=cannot locate feasible nodes". And during > the course of scheduling jobs, once there are more jobs requested than > there are nodes then they are listed as queued and in the sched_log/ log > files I see "Not enough of the right type of nodes available" for each new > request. I also tried adding np=8 to each of the nodes listed in > server_priv/nodes since I had not before, but it did not change anything. > > I will post my Torque config below, but I'm curious to know if > --enable-cpuset is what I need, since it is not made explicit that it is a > required flag to allow more than one job to run per node. Setting the > default and max settings was my attempt to get this working, although we > have another cluster that doesn't specify any of that and it runs as > expected by reserving the amount of cpus per node that you request with npp. > > qmgr -c "print server" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_max.ncpus = 8 > set queue batch resources_max.nodes = 2 > set queue batch resources_min.ncpus = 1 > set queue batch resources_default.ncpus = 1 > set queue batch resources_default.nodect = 1 > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 32:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = penguin-head01.compute.haib.org > set server managers = root at penguin-head01.compute.haib.org > set server operators = root at penguin-head01.compute.haib.org > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 554 > ------------------------ > > qmgr -c "list server" > Server penguin-head01.compute.haib.org > server_state = Active > scheduling = True > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = penguin-head01.compute.haib.org > managers = root at penguin-head01.compute.haib.org > operators = root at penguin-head01.compute.haib.org > default_queue = batch > log_events = 511 > mail_from = adm > resources_assigned.ncpus = 0 > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = True > pbs_version = 3.0.2 > keep_completed = 300 > next_job_number = 554 > net_counter = 2 0 0 > > > Any suggestions would be appreciated! > > Thanks, > Josh > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/de2a5efc/attachment.html From kunalgrao at gmail.com Thu Jun 7 10:05:12 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Thu, 7 Jun 2012 12:05:12 -0400 Subject: [torqueusers] Torque 3.0.3 and chroot environment Message-ID: Hi All, Has anyone tried chroot environment for Torque 3.0.3 or later version ? I'm thinking of having multiple chroot environment on the same system, each representing a compute node and build a cluster. So, even though there are say only 2 physical machines ( 1 server and 1 compute node), we should be able to make a cluster of say 4 nodes. Assuming that the 1 physical compute node can have 3 chroot environment, each having its own virtual IP and communicating with the master as 3 independent compute nodes. Head node / server will see as if there are 4 nodes and the scheduler will aallocate jobs accordingly. Is this feasible and can work without any source code modifications to pbs server / pbs mom ? Thanks, Kunal -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/ec2672aa/attachment.html From dbeer at adaptivecomputing.com Thu Jun 7 10:16:48 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 7 Jun 2012 10:16:48 -0600 Subject: [torqueusers] Torque 3.0.3 and chroot environment In-Reply-To: References: Message-ID: Kunal, I have done a chroot environment with TORQUE - it worked fine. I was doing this for testing with sleep jobs, and the chroot was because I didn't want it to interact with anything else on the machine. I'm not sure what you're attempting to accomplish, but you may want to consider looking into the multi-mom feature (available starting in 3.0.0) that we also use a lot for testing. I have actually abandoned my chroot environment in favor of using the multi-moms. http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.8multimom.php David On Thu, Jun 7, 2012 at 10:05 AM, Kunal Rao wrote: > Hi All, > > Has anyone tried chroot environment for Torque 3.0.3 or later version ? > I'm thinking of having multiple chroot environment on the same system, each > representing a compute node and build a cluster. > > So, even though there are say only 2 physical machines ( 1 server and 1 > compute node), we should be able to make a cluster of say 4 nodes. Assuming > that the 1 physical compute node can have 3 chroot environment, > each having its own virtual IP and communicating with the master as 3 > independent compute nodes. Head node / server will see as if there are 4 > nodes and the scheduler will aallocate jobs accordingly. > > Is this feasible and can work without any source code modifications to pbs > server / pbs mom ? > > Thanks, > Kunal > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/675bd88b/attachment.html From kunalgrao at gmail.com Thu Jun 7 10:50:34 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Thu, 7 Jun 2012 12:50:34 -0400 Subject: [torqueusers] Torque 3.0.3 and chroot environment In-Reply-To: References: Message-ID: Hi David, Thanks for your quick response and for pointing to the multi-mom feature. The idea is similar i.e. make a small cluster look bigger with being as realistic as possible. I read through that page and seems like it will do what I want. I had a follow up question on that : - Does each mom read from /proc and report to the head node (pbs_server) ? In that case the total cpus , memory, load etc. will be reported same from each of them. Can that be isolated and different for each of them to mimic actual large cluster i.e. each having different number of cpus, memory, load etc. Thanks, Kunal On Thu, Jun 7, 2012 at 12:16 PM, David Beer wrote: > Kunal, > > I have done a chroot environment with TORQUE - it worked fine. I was doing > this for testing with sleep jobs, and the chroot was because I didn't want > it to interact with anything else on the machine. I'm not sure what you're > attempting to accomplish, but you may want to consider looking into the > multi-mom feature (available starting in 3.0.0) that we also use a lot for > testing. I have actually abandoned my chroot environment in favor of using > the multi-moms. > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.8multimom.php > > David > > On Thu, Jun 7, 2012 at 10:05 AM, Kunal Rao wrote: > >> Hi All, >> >> Has anyone tried chroot environment for Torque 3.0.3 or later version ? >> I'm thinking of having multiple chroot environment on the same system, each >> representing a compute node and build a cluster. >> >> So, even though there are say only 2 physical machines ( 1 server and 1 >> compute node), we should be able to make a cluster of say 4 nodes. Assuming >> that the 1 physical compute node can have 3 chroot environment, >> each having its own virtual IP and communicating with the master as 3 >> independent compute nodes. Head node / server will see as if there are 4 >> nodes and the scheduler will aallocate jobs accordingly. >> >> Is this feasible and can work without any source code modifications to >> pbs server / pbs mom ? >> >> Thanks, >> Kunal >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/99d385b0/attachment-0001.html From dbeer at adaptivecomputing.com Thu Jun 7 11:08:36 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 7 Jun 2012 11:08:36 -0600 Subject: [torqueusers] Torque 3.0.3 and chroot environment In-Reply-To: References: Message-ID: Kunal, As of now each will report the same thing. If you wanted them to change each one, you'd have to modify the code. It wouldn't be too hard to do (the mom daemons know that they're running multi-mom) but it would take some customization. David On Thu, Jun 7, 2012 at 10:50 AM, Kunal Rao wrote: > Hi David, > > Thanks for your quick response and for pointing to the multi-mom feature. > The idea is similar i.e. make a small cluster look bigger with being as > realistic as possible. > > I read through that page and seems like it will do what I want. I had a > follow up question on that : > > - Does each mom read from /proc and report to the head node (pbs_server) ? > In that case the total cpus , memory, load etc. will be reported same from > each of them. Can that be isolated and different for each of them to mimic > actual large cluster i.e. each having different number of cpus, memory, > load etc. > > Thanks, > Kunal > > > > On Thu, Jun 7, 2012 at 12:16 PM, David Beer wrote: > >> Kunal, >> >> I have done a chroot environment with TORQUE - it worked fine. I was >> doing this for testing with sleep jobs, and the chroot was because I didn't >> want it to interact with anything else on the machine. I'm not sure what >> you're attempting to accomplish, but you may want to consider looking into >> the multi-mom feature (available starting in 3.0.0) that we also use a lot >> for testing. I have actually abandoned my chroot environment in favor of >> using the multi-moms. >> >> >> http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.8multimom.php >> >> David >> >> On Thu, Jun 7, 2012 at 10:05 AM, Kunal Rao wrote: >> >>> Hi All, >>> >>> Has anyone tried chroot environment for Torque 3.0.3 or later version ? >>> I'm thinking of having multiple chroot environment on the same system, each >>> representing a compute node and build a cluster. >>> >>> So, even though there are say only 2 physical machines ( 1 server and 1 >>> compute node), we should be able to make a cluster of say 4 nodes. Assuming >>> that the 1 physical compute node can have 3 chroot environment, >>> each having its own virtual IP and communicating with the master as 3 >>> independent compute nodes. Head node / server will see as if there are 4 >>> nodes and the scheduler will aallocate jobs accordingly. >>> >>> Is this feasible and can work without any source code modifications to >>> pbs server / pbs mom ? >>> >>> Thanks, >>> Kunal >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/56a84bd2/attachment.html From kunalgrao at gmail.com Thu Jun 7 11:55:14 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Thu, 7 Jun 2012 13:55:14 -0400 Subject: [torqueusers] Torque 3.0.3 and chroot environment In-Reply-To: References: Message-ID: Hi David, Thanks for your response. This is exactly why I was thinking of a chroot kind of environment. Here each mom can have its own "/proc " directory and there could be an external script which populates various information for example loadavg, meminfo etc. and pbs_mom would read that and report to the head node. Thus there won't be any source code change at all and it would mimic a real large cluster. Were there any problems with the chroot environment settings that you had. If there is any documentation related to Torque configuration with chroot env. could you point me to that ? Coming back to the multi-mom feature, for each mom to report different load, memory etc. can we have : pbs_mom for node1 read from /proc/node1/ (we dump some loadavg , meminfo, cpuinfo etc. files here ) pbs_mom for node2 read from /proc/node2/ (here also we dump some loadavg, meminfo etc. files) ' ' I'm guessing for this, in the source code, wherever it is hard-coded to read from "/proc" path, it should take the path as an argument when pbs_mom is started. e.g. pbs_mom -u This will probably fix that issue and each mom will be able to report different resource information to the head node and will mimic a real large cluster with a smaller set of nodes. Let me know your thoughts on that. Are there other approaches that you have in mind ? Thanks, Kunal On Thu, Jun 7, 2012 at 1:08 PM, David Beer wrote: > Kunal, > > As of now each will report the same thing. If you wanted them to change > each one, you'd have to modify the code. It wouldn't be too hard to do (the > mom daemons know that they're running multi-mom) but it would take some > customization. > > David > > > On Thu, Jun 7, 2012 at 10:50 AM, Kunal Rao wrote: > >> Hi David, >> >> Thanks for your quick response and for pointing to the multi-mom feature. >> The idea is similar i.e. make a small cluster look bigger with being as >> realistic as possible. >> >> I read through that page and seems like it will do what I want. I had a >> follow up question on that : >> >> - Does each mom read from /proc and report to the head node (pbs_server) >> ? In that case the total cpus , memory, load etc. will be reported same >> from each of them. Can that be isolated and different for each of them to >> mimic >> actual large cluster i.e. each having different number of cpus, memory, >> load etc. >> >> Thanks, >> Kunal >> >> >> >> On Thu, Jun 7, 2012 at 12:16 PM, David Beer wrote: >> >>> Kunal, >>> >>> I have done a chroot environment with TORQUE - it worked fine. I was >>> doing this for testing with sleep jobs, and the chroot was because I didn't >>> want it to interact with anything else on the machine. I'm not sure what >>> you're attempting to accomplish, but you may want to consider looking into >>> the multi-mom feature (available starting in 3.0.0) that we also use a lot >>> for testing. I have actually abandoned my chroot environment in favor of >>> using the multi-moms. >>> >>> >>> http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.8multimom.php >>> >>> David >>> >>> On Thu, Jun 7, 2012 at 10:05 AM, Kunal Rao wrote: >>> >>>> Hi All, >>>> >>>> Has anyone tried chroot environment for Torque 3.0.3 or later version ? >>>> I'm thinking of having multiple chroot environment on the same system, each >>>> representing a compute node and build a cluster. >>>> >>>> So, even though there are say only 2 physical machines ( 1 server and 1 >>>> compute node), we should be able to make a cluster of say 4 nodes. Assuming >>>> that the 1 physical compute node can have 3 chroot environment, >>>> each having its own virtual IP and communicating with the master as 3 >>>> independent compute nodes. Head node / server will see as if there are 4 >>>> nodes and the scheduler will aallocate jobs accordingly. >>>> >>>> Is this feasible and can work without any source code modifications to >>>> pbs server / pbs mom ? >>>> >>>> Thanks, >>>> Kunal >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> >>> -- >>> David Beer | Software Engineer >>> Adaptive Computing >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/9d3abc6c/attachment.html From david.ceresuela at gmail.com Thu Jun 7 12:02:07 2012 From: david.ceresuela at gmail.com (David Ceresuela) Date: Thu, 7 Jun 2012 20:02:07 +0200 Subject: [torqueusers] pbs_sched does not run jobs unless forced with qrun Message-ID: Hi all, I have installed torque-4.0.2 in an Ubuntu-10.04 and I don't know how to make the scheduler run jobs. If I use qrun the job will be executed, but otherwise it won't. I have made sure that: - trqauthd, pbs_server and pbs_sched are running on head node - pbs_mom is running on compute node - The user from which I submit the job exists in all (2) machines with same id and same group - iptables is not messing around and neither is a firewall - DNS is working propperly, both fordward and reverse. I am using a DNS server on another machine (neither head nor compute) - pbsnodes shows This is the setup: head node - hostname: lucid-tor1 - users: root, david - executing: /etc/init.d/trqauthd, pbs_server, pbs_sched (the simplest scheduler) compute node - hostname: lucid-tor2 - users: root, david - executing: pbs_mom This is the output from qmgr -c 'p s': # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_max.nodes = 2 set queue batch resources_min.nodes = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = lucid-tor1.cps.cloud set server acl_hosts += lucid-tor1 set server managers = david at lucid-tor1 set server managers += root at lucid-tor1 set server operators = david at lucid-tor1 set server operators += root at lucid-tor1 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server log_level = 3 set server mom_job_sync = True set server keep_completed = 10 set server next_job_number = 74 set server moab_array_compatible = True And this is the one from pbsnodes: lucid-tor2 state = free np = 1 ntype = cluster status = rectime=1339089060,varattr=,jobs=,state=free,netload=724555,gres=,loadave=0.00,ncpus=2,physmem=1023292kb,availmem=1458140kb,totmem=1521972kb,idletime=379,nusers=0,nsessions=0,uname=Linux lucid-tor2 2.6.32-33-server #70-Ubuntu SMP Thu Jul 7 22:28:30 UTC 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 In the logs, I find no errors: ### sched log 06/07/2012 18:58:44;0002; pbs_sched;Svr;Log;Log opened 06/07/2012 18:58:44;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120607 opened 06/07/2012 18:58:44;0002; pbs_sched;Svr;main;pbs_sched startup pid 1203 ### server log 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type AuthenticateUser request received from david at lucid-tor1.cps.cloud, sock=9 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type QueueJob request received from david at lucid-tor1.cps.cloud, sock=8 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type JobScript request received from david at lucid-tor1.cps.cloud, sock=8 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type ReadyToCommit request received from david at lucid-tor1.cps.cloud, sock=8 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type Commit request received from david at lucid-tor1.cps.cloud, sock=8 06/07/2012 18:59:28;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from TRANSIT-TRANSICM to QUEUED-QUEUED (1-10) 06/07/2012 18:59:28;0100;PBS_Server;Job;73.lucid-tor1.cps.cloud;enqueuing into batch, state 1 hop 1 06/07/2012 18:59:28;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;Job Queued at request of david at lucid-tor1.cps.cloud, owner = david at lucid-tor1.cps.cloud, job name = STDIN, queue = batch 06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:219: mom_port 15002 - rm_port 15003 06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:219) (sock 10) 06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 18:59:45;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 18:59:45;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=512, newstate=0 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:340: mom_port 15002 - rm_port 15003 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:340) (sock 8) 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 19:00:30;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 19:00:30;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=512, newstate=0 Then I run qrun: root at lucid-tor1: qrun 73 And these are the logs after qrun: ### sched log (it is the same) 06/07/2012 18:58:44;0002; pbs_sched;Svr;Log;Log opened 06/07/2012 18:58:44;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120607 opened 06/07/2012 18:58:44;0002; pbs_sched;Svr;main;pbs_sched startup pid 1203 ### server log ... 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:340: mom_port 15002 - rm_port 15003 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:340) (sock 8) 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 19:00:30;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 19:00:30;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=512, newstate=0 06/07/2012 19:01:14;0004;PBS_Server;Svr;check_nodes_work;node lucid-tor2.cps.cloud not detected in 1339088474 seconds, marking node down 06/07/2012 19:01:14;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2.cps.cloud - state=512, newstate=2 06/07/2012 19:01:14;0040;PBS_Server;Req;update_node_state;node lucid-tor2.cps.cloud marked down 06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:279: mom_port 15002 - rm_port 15003 06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:279) (sock 9) 06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 19:01:15;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 19:01:15;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=512, newstate=0 06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:251: mom_port 15002 - rm_port 15003 06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:251) (sock 8) 06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 19:02:00;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 19:02:00;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=512, newstate=0 06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:448: mom_port 15002 - rm_port 15003 06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:448) (sock 9) 06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 19:02:45;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 19:02:45;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=512, newstate=0 06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:650: mom_port 15002 - rm_port 15003 06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:650) (sock 8) 06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 19:03:30;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 19:03:30;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=512, newstate=0 06/07/2012 19:03:59;0100;PBS_Server;Req;;Type AuthenticateUser request received from root at lucid-tor1.cps.cloud, sock=8 06/07/2012 19:03:59;0100;PBS_Server;Req;;Type RunJob request received from root at lucid-tor1.cps.cloud, sock=9 06/07/2012 19:03:59;0040;PBS_Server;Req;set_nodes;allocating nodes for job 73.lucid-tor1.cps.cloud with node expression '1' 06/07/2012 19:03:59;0040;PBS_Server;Req;set_nodes;job 73.lucid-tor1.cps.cloud allocated 1 nodes (nodelist=lucid-tor2/0) 06/07/2012 19:03:59;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;Job Run at request of root at lucid-tor1.cps.cloud 06/07/2012 19:03:59;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from QUEUED-QUEUED to RUNNING-PRERUN (4-40) 06/07/2012 19:03:59;0002;PBS_Server;Job;73.lucid-tor1.cps.cloud;child reported success for job after 0 seconds (dest=???), rc=0 06/07/2012 19:03:59;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from RUNNING-TRNOUTCM to RUNNING-RUNNING (4-42) 06/07/2012 19:03:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 3 06/07/2012 19:04:03;0100;PBS_Server;Req;;Type StatusJob request received from pbs_mom at lucid-tor2.cps.cloud, sock=10 06/07/2012 19:04:03;0100;PBS_Server;Req;;Type JobObituary request received from pbs_mom at lucid-tor2.cps.cloud, sock=8 06/07/2012 19:04:03;0009;PBS_Server;Job;73.lucid-tor1.cps.cloud;obit received - updating final job usage info 06/07/2012 19:04:03;0009;PBS_Server;Job;73.lucid-tor1.cps.cloud;job exit status 0 handled 06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from RUNNING-RUNNING to EXITING-EXITING (5-50) 06/07/2012 19:04:03;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;preparing to send 'e' mail for job 73.lucid-tor1.cps.cloud to david at lucid-tor1.cps.cloud (Exit_status=0) 06/07/2012 19:04:03;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Not sending email: User does not want mail of this type. 06/07/2012 19:04:03;0010;PBS_Server;Job;73.lucid-tor1.cps.cloud;Exit_status=0 06/07/2012 19:04:03;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;on_job_exit valid pjob: 0x2493f30 (substate=50) 06/07/2012 19:04:03;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;JOB_SUBSTATE_EXITING 06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from EXITING-EXITING to EXITING-RETURNSTD (5-70) 06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52) 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Post job file processing error; job 73.lucid-tor1.cps.cloud on host lucid-tor2/0 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;request to copy stageout files failed on node 'lucid-tor2/0' for job 73.lucid-tor1.cps.cloud 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;preparing to send 'o' mail for job 73.lucid-tor1.cps.cloud to david at lucid-tor1.cps.cloud (request to copy stageout files failed on node 'lucid-tor2/0' for) 06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from EXITING-STAGEDEL to EXITING-EXITED (5-53) 06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from EXITING-EXITED to EXITING-ABORT (5-54) 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Email 'o' to david at lucid-tor1.cps.cloud failed: Child process 'sendmail -f adm david at lucid-tor1.cps.cloud' returned 127 (errno 0:Success) 06/07/2012 19:04:12;0040;PBS_Server;Req;free_nodes;freeing nodes for job 73.lucid-tor1.cps.cloud 06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 73.lucid-tor1.cps.cloud state from EXITING-ABORT to COMPLETE-COMPLETE (6-59) 06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;message received from addr 155.210.155.xx7:175: mom_port 15002 - rm_port 15003 06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:175) (sock 9) 06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from lucid-tor2.cps.cloud 06/07/2012 19:04:15;0040;PBS_Server;Req;is_stat_get;received status from node lucid-tor2.cps.cloud 06/07/2012 19:04:15;0040;PBS_Server;Req;update_node_state;adjusting state for node lucid-tor2 - state=0, newstate=0 06/07/2012 19:04:27;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;on_job_exit valid pjob: 0x2493f30 (substate=59) 06/07/2012 19:04:27;0100;PBS_Server;Job;73.lucid-tor1.cps.cloud;dequeuing from batch, state COMPLETE ### mom log 06/07/2012 18:59:00;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 06/07/2012 18:59:45;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 06/07/2012 19:00:30;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 06/07/2012 19:00:36;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.0.2, loglevel = 3 06/07/2012 19:01:15;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 06/07/2012 19:02:00;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 06/07/2012 19:02:45;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 06/07/2012 19:03:30;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 06/07/2012 19:03:59;0008; pbs_mom;Job;mom_process_request;request type QueueJob from host lucid-tor1.cps.cloud allowed 06/07/2012 19:03:59;0008; pbs_mom;Job;mom_process_request;request type JobScript from host lucid-tor1.cps.cloud allowed 06/07/2012 19:03:59;0008; pbs_mom;Job;mom_process_request;request type ReadyToCommit from host lucid-tor1.cps.cloud allowed 06/07/2012 19:03:59;0008; pbs_mom;Job;mom_process_request;request type Commit from host lucid-tor1.cps.cloud allowed 06/07/2012 19:03:59;0001; pbs_mom;Job;job_nodes;job: 73.lucid-tor1.cps.cloud numnodes=1 numvnod=1 06/07/2012 19:03:59;0001; pbs_mom;Job;73.lucid-tor1.cps.cloud;phase 2 of job launch successfully completed 06/07/2012 19:03:59;0001; pbs_mom;Job;TMomFinalizeJob3;Job 73.lucid-tor1.cps.cloud read start return code=0 session=959 06/07/2012 19:03:59;0001; pbs_mom;Job;TMomFinalizeJob3;job 73.lucid-tor1.cps.cloud started, pid = 959 06/07/2012 19:03:59;0001; pbs_mom;Job;73.lucid-tor1.cps.cloud;exec_job_on_ms:job successfully started 06/07/2012 19:04:03;0008; pbs_mom;Job;scan_for_terminated;pid 959 harvested for job 73.lucid-tor1.cps.cloud, task 1, exitcode=0 06/07/2012 19:04:03;0080; pbs_mom;Job;73.lucid-tor1.cps.cloud;scan_for_terminated: job 73.lucid-tor1.cps.cloud task 1 terminated, sid=959 06/07/2012 19:04:03;0080; pbs_mom;Svr;scan_for_exiting;searching for exiting jobs 06/07/2012 19:04:03;0008; pbs_mom;Job;kill_job;scan_for_exiting: sending signal 9, "KILL" to job 73.lucid-tor1.cps.cloud, reason: local task termination detected 06/07/2012 19:04:03;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 06/07/2012 19:04:03;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 06/07/2012 19:04:03;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 06/07/2012 19:04:03;0080; pbs_mom;Job;73.lucid-tor1.cps.cloud;performing job clean-up in preobit_reply() 06/07/2012 19:04:03;0080; pbs_mom;Job;73.lucid-tor1.cps.cloud;epilog subtask created with pid 962 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue 06/07/2012 19:04:03;0080; pbs_mom;Req;post_epilogue;preparing obit message for job 73.lucid-tor1.cps.cloud 06/07/2012 19:04:03;0080; pbs_mom;Job;73.lucid-tor1.cps.cloud;obit sent to server 06/07/2012 19:04:03;0008; pbs_mom;Job;mom_process_request;request type CopyFiles from host lucid-tor1.cps.cloud allowed 06/07/2012 19:04:03;0008; pbs_mom;Job;73.lucid-tor1.cps.cloud;forking to user, uid: 1000 gid: 1000 homedir: '/home/david' 06/07/2012 19:04:12;0008; pbs_mom;Job;scan_for_terminated;pid 963 not tracked, statloc=0, exitval=0 06/07/2012 19:04:12;0008; pbs_mom;Job;mom_process_request;request type DeleteJob from host lucid-tor1.cps.cloud allowed 06/07/2012 19:04:12;0008; pbs_mom;Job;73.lucid-tor1.cps.cloud;deleting job 06/07/2012 19:04:12;0080; pbs_mom;Job;73.lucid-tor1.cps.cloud;deleting job 73.lucid-tor1.cps.cloud in state EXITED 06/07/2012 19:04:12;0080; pbs_mom;Job;73.lucid-tor1.cps.cloud;removed job script 06/07/2012 19:04:15;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to lucid-tor1 So the job gets executed on the compute node, but only if I force it with qrun. I have already found these questions - http://serverfault.com/questions/258195/torque-jobs-does-not-enter-e-state-unless-qrun - http://www.clusterresources.com/pipermail/torqueusers/2004-October/000871.html - http://www.supercluster.org/pipermail/torqueusers/2011-April/012609.html but the answers don't help that much. Did I miss anything? In case I didn't, do you have any idea of what could be going wrong? Thank you very much. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/2e778f02/attachment-0001.html From knielson at adaptivecomputing.com Thu Jun 7 14:01:28 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 7 Jun 2012 14:01:28 -0600 Subject: [torqueusers] Update on source URL for TORQUE In-Reply-To: References: Message-ID: Hi all, The svn://opensvn.adaptivecomputing.com/torque URL should now be working. Please let us know if you are still having problems. Ken Nielson Adaptive Computing On Wed, Jun 6, 2012 at 2:59 PM, Ken Nielson wrote: > Hi All, > > In a previous e-mail we announced that the new subversion URL to download > TORQUE source will be svn://opensvn.adaptivecomputing.com/torque. > > The subversion URL svn://clusterresources.com/torque is still available > but will be decommissioned on June 20, 2012. > > Please start using the svn://opensvn.adaptivecomputing.com/torque URL and > update any documentation you may have. > > Regards > > Ken Nielson > Adaptive Computing > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/ae50771c/attachment.html From ataufer at adaptivecomputing.com Thu Jun 7 14:11:11 2012 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Thu, 7 Jun 2012 14:11:11 -0600 Subject: [torqueusers] pbs_sched does not run jobs unless forced with qrun In-Reply-To: References: Message-ID: The scheduler was broken in 4.0.2. It was fixed last month and should be working correctly in the torque-4.0.3-snap.201206071306.tar.gz snapshot that was released earlier today. On Thu, Jun 7, 2012 at 12:02 PM, David Ceresuela wrote: > Hi all, > > I have installed torque-4.0.2 in an Ubuntu-10.04 and I don't know how to > make the scheduler run jobs. If I use qrun the job will be executed, but > otherwise it won't. > I have made sure that: > - trqauthd, pbs_server and pbs_sched are running on head node > - pbs_mom is running on compute node > - The user from which I submit the job exists in all (2) machines with same > id and same group > - iptables is not messing around and neither is a firewall > - DNS is working propperly, both fordward and reverse. I am using a DNS > server on another machine (neither head nor compute) > - pbsnodes shows > > This is the setup: > > head node > ? - hostname: lucid-tor1 > ? - users: root, david > ? - executing: /etc/init.d/trqauthd, pbs_server, pbs_sched (the simplest > scheduler) > > compute node > ? - hostname: lucid-tor2 > ? - users: root, david > ? - executing: pbs_mom > > This is the output from qmgr -c 'p s': > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_max.nodes = 2 > set queue batch resources_min.nodes = 1 > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = lucid-tor1.cps.cloud > set server acl_hosts += lucid-tor1 > set server managers = david at lucid-tor1 > set server managers += root at lucid-tor1 > set server operators = david at lucid-tor1 > set server operators += root at lucid-tor1 > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 300 > set server job_stat_rate = 45 > set server poll_jobs = True > set server log_level = 3 > set server mom_job_sync = True > set server keep_completed = 10 > set server next_job_number = 74 > set server moab_array_compatible = True > > And this is the one from pbsnodes: > lucid-tor2 > ? ? ?state = free > ? ? ?np = 1 > ? ? ?ntype = cluster > ? ? ?status = > rectime=1339089060,varattr=,jobs=,state=free,netload=724555,gres=,loadave=0.00,ncpus=2,physmem=1023292kb,availmem=1458140kb,totmem=1521972kb,idletime=379,nusers=0,nsessions=0,uname=Linux > lucid-tor2 2.6.32-33-server #70-Ubuntu SMP Thu Jul 7 22:28:30 UTC 2011 > x86_64,opsys=linux > ? ? ?mom_service_port = 15002 > ? ? ?mom_manager_port = 15003 > ? ? ?gpus = 0 > > > In the logs, I find no errors: > > ### sched log > 06/07/2012 18:58:44;0002; pbs_sched;Svr;Log;Log opened > 06/07/2012 18:58:44;0002; pbs_sched;Svr;TokenAct;Account file > /var/spool/torque/sched_priv/accounting/20120607 opened > 06/07/2012 18:58:44;0002; pbs_sched;Svr;main;pbs_sched startup pid 1203 > > ### server log > 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type AuthenticateUser request > received from david at lucid-tor1.cps.cloud, sock=9 > 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type QueueJob request received from > david at lucid-tor1.cps.cloud, sock=8 > 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type JobScript request received > from david at lucid-tor1.cps.cloud, sock=8 > 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type ReadyToCommit request received > from david at lucid-tor1.cps.cloud, sock=8 > 06/07/2012 18:59:28;0100;PBS_Server;Req;;Type Commit request received from > david at lucid-tor1.cps.cloud, sock=8 > 06/07/2012 18:59:28;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from TRANSIT-TRANSICM to QUEUED-QUEUED > (1-10) > 06/07/2012 18:59:28;0100;PBS_Server;Job;73.lucid-tor1.cps.cloud;enqueuing > into batch, state 1 hop 1 > 06/07/2012 18:59:28;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;Job Queued > at request of david at lucid-tor1.cps.cloud, owner = > david at lucid-tor1.cps.cloud, job name = STDIN, queue = batch > 06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:219: mom_port 15002 ?- rm_port 15003 > 06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:219) (sock > 10) > 06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 18:59:45;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 18:59:45;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=512, newstate=0 > 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:340: mom_port 15002 ?- rm_port 15003 > 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:340) (sock > 8) > 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 19:00:30;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 19:00:30;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=512, newstate=0 > > Then I run qrun: > root at lucid-tor1: qrun 73 > > > And these are the logs after qrun: > > ### sched log (it is the same) > 06/07/2012 18:58:44;0002; pbs_sched;Svr;Log;Log opened > 06/07/2012 18:58:44;0002; pbs_sched;Svr;TokenAct;Account file > /var/spool/torque/sched_priv/accounting/20120607 opened > 06/07/2012 18:58:44;0002; pbs_sched;Svr;main;pbs_sched startup pid 1203 > > ### server log > ... > 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:340: mom_port 15002 ?- rm_port 15003 > 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:340) (sock > 8) > 06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 19:00:30;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 19:00:30;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=512, newstate=0 > 06/07/2012 19:01:14;0004;PBS_Server;Svr;check_nodes_work;node > lucid-tor2.cps.cloud not detected in 1339088474 seconds, marking node down > 06/07/2012 19:01:14;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2.cps.cloud - state=512, newstate=2 > 06/07/2012 19:01:14;0040;PBS_Server;Req;update_node_state;node > lucid-tor2.cps.cloud marked down > 06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:279: mom_port 15002 ?- rm_port 15003 > 06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:279) (sock > 9) > 06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 19:01:15;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 19:01:15;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=512, newstate=0 > 06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:251: mom_port 15002 ?- rm_port 15003 > 06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:251) (sock > 8) > 06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 19:02:00;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 19:02:00;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=512, newstate=0 > 06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:448: mom_port 15002 ?- rm_port 15003 > 06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:448) (sock > 9) > 06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 19:02:45;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 19:02:45;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=512, newstate=0 > 06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:650: mom_port 15002 ?- rm_port 15003 > 06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:650) (sock > 8) > 06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 19:03:30;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 19:03:30;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=512, newstate=0 > 06/07/2012 19:03:59;0100;PBS_Server;Req;;Type AuthenticateUser request > received from root at lucid-tor1.cps.cloud, sock=8 > 06/07/2012 19:03:59;0100;PBS_Server;Req;;Type RunJob request received from > root at lucid-tor1.cps.cloud, sock=9 > 06/07/2012 19:03:59;0040;PBS_Server;Req;set_nodes;allocating nodes for job > 73.lucid-tor1.cps.cloud with node expression '1' > 06/07/2012 19:03:59;0040;PBS_Server;Req;set_nodes;job > 73.lucid-tor1.cps.cloud allocated 1 nodes (nodelist=lucid-tor2/0) > 06/07/2012 19:03:59;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;Job Run at > request of root at lucid-tor1.cps.cloud > 06/07/2012 19:03:59;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from QUEUED-QUEUED to RUNNING-PRERUN > (4-40) > 06/07/2012 19:03:59;0002;PBS_Server;Job;73.lucid-tor1.cps.cloud;child > reported success for job after 0 seconds (dest=???), rc=0 > 06/07/2012 19:03:59;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from RUNNING-TRNOUTCM to RUNNING-RUNNING > (4-42) > 06/07/2012 19:03:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 4.0.2, loglevel = 3 > 06/07/2012 19:04:03;0100;PBS_Server;Req;;Type StatusJob request received > from pbs_mom at lucid-tor2.cps.cloud, sock=10 > 06/07/2012 19:04:03;0100;PBS_Server;Req;;Type JobObituary request received > from pbs_mom at lucid-tor2.cps.cloud, sock=8 > 06/07/2012 19:04:03;0009;PBS_Server;Job;73.lucid-tor1.cps.cloud;obit > received - updating final job usage info > 06/07/2012 19:04:03;0009;PBS_Server;Job;73.lucid-tor1.cps.cloud;job exit > status 0 handled > 06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from RUNNING-RUNNING to EXITING-EXITING > (5-50) > 06/07/2012 19:04:03;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;preparing to > send 'e' mail for job 73.lucid-tor1.cps.cloud to david at lucid-tor1.cps.cloud > (Exit_status=0) > 06/07/2012 19:04:03;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Not sending > email: User does not want mail of this type. > 06/07/2012 > 19:04:03;0010;PBS_Server;Job;73.lucid-tor1.cps.cloud;Exit_status=0 > 06/07/2012 19:04:03;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;on_job_exit > valid pjob: 0x2493f30 (substate=50) > 06/07/2012 > 19:04:03;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;JOB_SUBSTATE_EXITING > 06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from EXITING-EXITING to EXITING-RETURNSTD > (5-70) > 06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from EXITING-RETURNSTD to EXITING-STAGEDEL > (5-52) > 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Post job > file processing error; job 73.lucid-tor1.cps.cloud on host lucid-tor2/0 > 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;request to > copy stageout files failed on node 'lucid-tor2/0' for job > 73.lucid-tor1.cps.cloud > 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;preparing to > send 'o' mail for job 73.lucid-tor1.cps.cloud to david at lucid-tor1.cps.cloud > (request to copy stageout files failed on node 'lucid-tor2/0' for) > 06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from EXITING-STAGEDEL to EXITING-EXITED > (5-53) > 06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from EXITING-EXITED to EXITING-ABORT > (5-54) > 06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Email 'o' to > david at lucid-tor1.cps.cloud failed: Child process 'sendmail -f adm > david at lucid-tor1.cps.cloud' returned 127 (errno 0:Success) > 06/07/2012 19:04:12;0040;PBS_Server;Req;free_nodes;freeing nodes for job > 73.lucid-tor1.cps.cloud > 06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting > job 73.lucid-tor1.cps.cloud state from EXITING-ABORT to COMPLETE-COMPLETE > (6-59) > 06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;message received from > addr 155.210.155.xx7:175: mom_port 15002 ?- rm_port 15003 > 06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) > received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:175) (sock > 9) > 06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received > from lucid-tor2.cps.cloud > 06/07/2012 19:04:15;0040;PBS_Server;Req;is_stat_get;received status from > node lucid-tor2.cps.cloud > 06/07/2012 19:04:15;0040;PBS_Server;Req;update_node_state;adjusting state > for node lucid-tor2 - state=0, newstate=0 > 06/07/2012 19:04:27;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;on_job_exit > valid pjob: 0x2493f30 (substate=59) > 06/07/2012 19:04:27;0100;PBS_Server;Job;73.lucid-tor1.cps.cloud;dequeuing > from batch, state COMPLETE > > ### mom log > 06/07/2012 18:59:00;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > 06/07/2012 18:59:45;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > 06/07/2012 19:00:30;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > 06/07/2012 19:00:36;0002; ? pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.0.2, > loglevel = 3 > 06/07/2012 19:01:15;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > 06/07/2012 19:02:00;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > 06/07/2012 19:02:45;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > 06/07/2012 19:03:30;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > 06/07/2012 19:03:59;0008; ? pbs_mom;Job;mom_process_request;request type > QueueJob from host lucid-tor1.cps.cloud allowed > 06/07/2012 19:03:59;0008; ? pbs_mom;Job;mom_process_request;request type > JobScript from host lucid-tor1.cps.cloud allowed > 06/07/2012 19:03:59;0008; ? pbs_mom;Job;mom_process_request;request type > ReadyToCommit from host lucid-tor1.cps.cloud allowed > 06/07/2012 19:03:59;0008; ? pbs_mom;Job;mom_process_request;request type > Commit from host lucid-tor1.cps.cloud allowed > 06/07/2012 19:03:59;0001; ? pbs_mom;Job;job_nodes;job: > 73.lucid-tor1.cps.cloud numnodes=1 numvnod=1 > 06/07/2012 19:03:59;0001; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;phase 2 of > job launch successfully completed > 06/07/2012 19:03:59;0001; ? pbs_mom;Job;TMomFinalizeJob3;Job > 73.lucid-tor1.cps.cloud read start return code=0 session=959 > 06/07/2012 19:03:59;0001; ? pbs_mom;Job;TMomFinalizeJob3;job > 73.lucid-tor1.cps.cloud started, pid = 959 > 06/07/2012 19:03:59;0001; > pbs_mom;Job;73.lucid-tor1.cps.cloud;exec_job_on_ms:job successfully started > 06/07/2012 19:04:03;0008; ? pbs_mom;Job;scan_for_terminated;pid 959 > harvested for job 73.lucid-tor1.cps.cloud, task 1, exitcode=0 > 06/07/2012 19:04:03;0080; > pbs_mom;Job;73.lucid-tor1.cps.cloud;scan_for_terminated: job > 73.lucid-tor1.cps.cloud task 1 terminated, sid=959 > 06/07/2012 19:04:03;0080; ? pbs_mom;Svr;scan_for_exiting;searching for > exiting jobs > 06/07/2012 19:04:03;0008; ? pbs_mom;Job;kill_job;scan_for_exiting: sending > signal 9, "KILL" to job 73.lucid-tor1.cps.cloud, reason: local task > termination detected > 06/07/2012 19:04:03;0080; ? pbs_mom;Svr;preobit_reply;top of preobit_reply > 06/07/2012 19:04:03;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of > while loop > 06/07/2012 19:04:03;0080; ? pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 06/07/2012 19:04:03;0080; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;performing > job clean-up in preobit_reply() > 06/07/2012 19:04:03;0080; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;epilog > subtask created with pid 962 - substate set to JOB_SUBSTATE_OBIT - > registered post_epilogue > 06/07/2012 19:04:03;0080; ? pbs_mom;Req;post_epilogue;preparing obit message > for job 73.lucid-tor1.cps.cloud > 06/07/2012 19:04:03;0080; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;obit sent to > server > 06/07/2012 19:04:03;0008; ? pbs_mom;Job;mom_process_request;request type > CopyFiles from host lucid-tor1.cps.cloud allowed > 06/07/2012 19:04:03;0008; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;forking to > user, uid: 1000 ?gid: 1000 ?homedir: '/home/david' > 06/07/2012 19:04:12;0008; ? pbs_mom;Job;scan_for_terminated;pid 963 not > tracked, statloc=0, exitval=0 > 06/07/2012 19:04:12;0008; ? pbs_mom;Job;mom_process_request;request type > DeleteJob from host lucid-tor1.cps.cloud allowed > 06/07/2012 19:04:12;0008; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;deleting job > 06/07/2012 19:04:12;0080; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;deleting job > 73.lucid-tor1.cps.cloud in state EXITED > 06/07/2012 19:04:12;0080; ? pbs_mom;Job;73.lucid-tor1.cps.cloud;removed job > script > 06/07/2012 19:04:15;0002; ? pbs_mom;n/a;mom_server_update_stat;status update > successfully sent to lucid-tor1 > > So the job gets executed on the compute node, but only if I force it with > qrun. > I have already found these questions > -?http://serverfault.com/questions/258195/torque-jobs-does-not-enter-e-state-unless-qrun > -?http://www.clusterresources.com/pipermail/torqueusers/2004-October/000871.html > -?http://www.supercluster.org/pipermail/torqueusers/2011-April/012609.html > but the answers don't help that much. > > Did I miss anything? > In case I didn't, do you have any idea of what could be going wrong? > > Thank you very much. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From knielson at adaptivecomputing.com Thu Jun 7 18:54:35 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 7 Jun 2012 18:54:35 -0600 Subject: [torqueusers] TORQUE 2.4.17 available for download Message-ID: Hi all, We are proud to announce the availability of TORQUE 2.4.17. 2.4.17 will be the last official release of the TORQUE 2.4-fixes branch. For Moab users support from Adaptive Computing for 2.4.x will end on August 31, 2012. Following are the release notes for 2.4.17. New in 2.4.17 ------------- 2.4.17 will be the last release of the 2.4-fixes branch of TORQUE. The configuration option to build high availability changed from --with-high-availability to --enable-high-availability. A buffer overflow problem was fixed in tcp_puts which at times made it so not enough memory would be allocated for outbound data. This may account for some segfaults and memory corruption problems. For a list of the remaining fixes please see CHANGELOG. Please if you need this version of TORQUE download and let us know if you have any problems. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/fba5a39a/attachment-0001.html From roy.dragseth at cc.uit.no Fri Jun 8 08:26:59 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Fri, 08 Jun 2012 16:26:59 +0200 Subject: [torqueusers] cpusets in Torque 3.0, memory pressure In-Reply-To: <1339044297.54959.YahooMailClassic@web111312.mail.gq1.yahoo.com> References: <1339044297.54959.YahooMailClassic@web111312.mail.gq1.yahoo.com> Message-ID: <1536854.4BgBIPbsZS@newton.cc.uit.no> Last time I checked torque killed all jobs on a node when the memory pressure limit was exceeded. See this thread for the details http://www.clusterresources.com/pipermail/torqueusers/2011- November/013609.html r. On Wednesday, June 06, 2012 21:44:57 Grigory Shamov wrote: Hi David, Thanks! I've tried to run some chemistry codes, under Torque 3, with and without oversubscription for memory. It seems to work. The pressure can quickly spike up to, say, 13000 with quite moderate swapping. If there are more than one jobs on the node, and one of them oversubscribes, would it create pressure for all the jobs there? -- Grigory Shamov --- On Mon, 6/4/12, David Beer wrote: From: David Beer Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure To: "Torque Users Mailing List" Date: Monday, June 4, 2012, 7:27 AM On Sun, Jun 3, 2012 at 7:47 AM, Grigory Shamov wrote: Dear David, Thanks a lot for your answer! What would be a good starting value for the memory pressure threshold, to configure it right? I think 1000 is a good starting value. David -- Grigory Shamov --- On Fri, 6/1/12, David Beer wrote: From: David Beer Subject: Re: [torqueusers] cpusets in Torque 3.0, memory pressure To: "Torque Users Mailing List" Date: Friday, June 1, 2012, 1:07 PM On Fri, Jun 1, 2012 at 11:57 AM, Grigory Shamov wrote: Hi, The Torque 3 changelog says " when cpusets are configured and memory pressure enabled, add the ability to check memory pressure for a job. Using $memory_pressure_threshold and $memory_pressure_duration in the mom's config, the admin sets a threshold at which a job becomes a problem". Does it work for non-NUMA setup as well? I mean, if I configure it with -- enable-cpusets only, but --disable-numa-support ? Grigory, As long as you set your cpuset up correctly (http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.cmomconfig.php) it will work with or without numa support. David -- Grigory Shamov HPC Analyst, University of Manitoba _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120608/a1c4367f/attachment.html From adaptivecomputing at bridgemailsystem.com Fri Jun 8 11:18:18 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Fri, 8 Jun 2012 10:18:18 -0700 (PDT) Subject: [torqueusers] Optimize Your HPC Environment for Big Data Message-ID: <6217787.1339175900482.JavaMail.root@mail4.bridgemailsystem.com> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120608/fe5cb432/attachment-0001.html From david.ceresuela at gmail.com Fri Jun 8 16:34:03 2012 From: david.ceresuela at gmail.com (David Ceresuela) Date: Sat, 9 Jun 2012 00:34:03 +0200 Subject: [torqueusers] pbs_sched does not run jobs unless forced with qrun Message-ID: That solved my problem. Thank you very much, Al. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120609/33f1dc53/attachment.html From kunalgrao at gmail.com Mon Jun 11 15:16:59 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Mon, 11 Jun 2012 17:16:59 -0400 Subject: [torqueusers] Torque 3.0.3 and chroot environment In-Reply-To: References: Message-ID: Any further thoughts / inputs on this ? Thanks, Kunal On Thu, Jun 7, 2012 at 1:55 PM, Kunal Rao wrote: > Hi David, > > Thanks for your response. > > This is exactly why I was thinking of a chroot kind of environment. Here > each mom can have its own "/proc " directory and there could be an external > script which populates various information for example loadavg, meminfo > etc. and pbs_mom would read that and report to the head node. Thus there > won't be any source code change at all and it would mimic a real large > cluster. > > Were there any problems with the chroot environment settings that you had. > If there is any documentation related to Torque configuration with chroot > env. could you point me to that ? > > Coming back to the multi-mom feature, for each mom to report different > load, memory etc. can we have : > > pbs_mom for node1 read from /proc/node1/ (we dump some loadavg , > meminfo, cpuinfo etc. files here ) > > pbs_mom for node2 read from /proc/node2/ (here also we dump some > loadavg, meminfo etc. files) > > ' > ' > I'm guessing for this, in the source code, wherever it is hard-coded to > read from "/proc" path, it should take the path as an argument when pbs_mom > is started. > > e.g. pbs_mom -u > > This will probably fix that issue and each mom will be able to report > different resource information to the head node and will mimic a real large > cluster with a smaller set of nodes. > > Let me know your thoughts on that. Are there other approaches that you > have in mind ? > > Thanks, > Kunal > > > On Thu, Jun 7, 2012 at 1:08 PM, David Beer wrote: > >> Kunal, >> >> As of now each will report the same thing. If you wanted them to change >> each one, you'd have to modify the code. It wouldn't be too hard to do (the >> mom daemons know that they're running multi-mom) but it would take some >> customization. >> >> David >> >> >> On Thu, Jun 7, 2012 at 10:50 AM, Kunal Rao wrote: >> >>> Hi David, >>> >>> Thanks for your quick response and for pointing to the multi-mom >>> feature. The idea is similar i.e. make a small cluster look bigger with >>> being as realistic as possible. >>> >>> I read through that page and seems like it will do what I want. I had a >>> follow up question on that : >>> >>> - Does each mom read from /proc and report to the head node (pbs_server) >>> ? In that case the total cpus , memory, load etc. will be reported same >>> from each of them. Can that be isolated and different for each of them to >>> mimic >>> actual large cluster i.e. each having different number of cpus, >>> memory, load etc. >>> >>> Thanks, >>> Kunal >>> >>> >>> >>> On Thu, Jun 7, 2012 at 12:16 PM, David Beer >> > wrote: >>> >>>> Kunal, >>>> >>>> I have done a chroot environment with TORQUE - it worked fine. I was >>>> doing this for testing with sleep jobs, and the chroot was because I didn't >>>> want it to interact with anything else on the machine. I'm not sure what >>>> you're attempting to accomplish, but you may want to consider looking into >>>> the multi-mom feature (available starting in 3.0.0) that we also use a lot >>>> for testing. I have actually abandoned my chroot environment in favor of >>>> using the multi-moms. >>>> >>>> >>>> http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.8multimom.php >>>> >>>> David >>>> >>>> On Thu, Jun 7, 2012 at 10:05 AM, Kunal Rao wrote: >>>> >>>>> Hi All, >>>>> >>>>> Has anyone tried chroot environment for Torque 3.0.3 or later version >>>>> ? I'm thinking of having multiple chroot environment on the same system, >>>>> each representing a compute node and build a cluster. >>>>> >>>>> So, even though there are say only 2 physical machines ( 1 server and >>>>> 1 compute node), we should be able to make a cluster of say 4 nodes. >>>>> Assuming that the 1 physical compute node can have 3 chroot environment, >>>>> each having its own virtual IP and communicating with the master as 3 >>>>> independent compute nodes. Head node / server will see as if there are 4 >>>>> nodes and the scheduler will aallocate jobs accordingly. >>>>> >>>>> Is this feasible and can work without any source code modifications to >>>>> pbs server / pbs mom ? >>>>> >>>>> Thanks, >>>>> Kunal >>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> >>>> >>>> >>>> -- >>>> David Beer | Software Engineer >>>> Adaptive Computing >>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120611/7eb93fd8/attachment.html From knielson at adaptivecomputing.com Tue Jun 12 13:26:47 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 12 Jun 2012 13:26:47 -0600 Subject: [torqueusers] TORQUE 2.5.12 available Message-ID: Hi all, TORQUE 2.5.12 is available for general use. The tar ball can be downloaded from http://www.adaptivecomputing.com/support/download-center/torque-download/ New in 2.5.12 Enabled MOMs to change a users group on stderr/stdout files. This fixes a problem where files owned by the user but not its group caused a job to fail. Fixed a large memory leak on the mom when configured with --enable-nvidia-gpus. A buffer overflow problem was fixed in tcp_puts which at times made it so not enough memory would be allocated for outbound data. This may account for some segfaults and memory corruption problems. add the mom config option exec_with_exec. When set to true, the pbs_mom will exec the job script instead of piping it to stdin. This makes signal trapping easier because the shell doesn't have to be configured to trap the signal as well. For all bugs fixed in 2.5.12 please see the CHANGELOG. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120612/0f8a863f/attachment.html From chris.evert at geokinetics.com Tue Jun 12 14:36:49 2012 From: chris.evert at geokinetics.com (Chris Evert) Date: Tue, 12 Jun 2012 15:36:49 -0500 Subject: [torqueusers] qsub -l file= Message-ID: <4FD7A861.1040203@geokinetics.com> (Torque|Maui)Users, We have threaded applications, so when we submit jobs, we use qsub -l nodes=1:ppn=16 etc. But now, we want to ensure enough local disk space for processing (200G, some nodes have more space than others). When we try qsub -l nodes=1:ppn=16,file=200gb etc. maui won't schedule the job, because we have no nodes with 200gb x 16 disk space. When we try qsub -l nodes=1:ppn=16,file=12gb etc. the job is scheduled (we have nodes with >200G disk space free), but when it runs and writes a local file larger than 12gb, the job is killed for exceeding ulimit file size. How can we submit a single job that uses 16 processors and 200G of local disk space when there is 250G of local disk space available on some nodes with 16 cpus? torque 2.3.10 maui 3.3 Thanks, Chris -- Chris Evert Geokinetics, Inc. Houston, TX From castellacci.franco at gmail.com Tue Jun 12 10:33:19 2012 From: castellacci.franco at gmail.com (Franco Castellacci) Date: Tue, 12 Jun 2012 13:33:19 -0300 Subject: [torqueusers] Node is in state down, normal configuration Message-ID: Hello, I recently add a node to our cluster, and follow my usual procedure. Im using dnsmasq to handle DNS redirection so the nodes are identified by their names. This is what I found on mom_logs : "pbs_mom;Svr;pbs_mom;LOG_ERROR::pbs_mom, Unable to get my full hostname for nodo-31 error -1" I don`t get it, i tried doing reverse DNS lookup, and both the node and frontend recognize eachoter by name and by ip addres. I can do ssh as well I`ll aprieciate any help From delphine.ramalingom at univ-reunion.fr Wed Jun 13 00:25:27 2012 From: delphine.ramalingom at univ-reunion.fr (Delphine Ramalingom) Date: Wed, 13 Jun 2012 10:25:27 +0400 Subject: [torqueusers] pbsdsh Message-ID: <4FD83257.4040706@univ-reunion.fr> Hi, I have installed torque 4.0.0 on a workstation with 32 cores. I tried to use pbsdsh and want to submit parallel jobs without using mpi. Is there a documentation that explains how do that ? I have just create un script for torque : #!/bin/sh #PBS -N pbsdsh #PBS -l select=1:ppn=8:mem=500mb #PBS -l walltime=03:00:00 #PBS -q oneday #PBS -j oe #PBS -V cd $PBS_O_WORKDIR # Run the other script /usr/local/bin/pbsdsh -v $PBS_O_WORKDIR/mypbsdsh And my program mypbsdsh read some files with a number that correspnds to PBS_TASKNUM file.$PBS_TASKNUM But $PBS_TASKNUM always equals to 2 and only one file is reading. Why ? I've also this error : pbsdsh: spawned task 0 pbsdsh: spawn event returned: 0 (1 spawns and 0 obits outstanding) pbsdsh: sending obit for task 2 pbsdsh: Event poll failed, error TM_ENOTCONNECTED pbsdsh: reconnected pbsdsh: sending obit for task 2 pbsdsh: obit event returned: 0 (0 spawns and 1 obits outstanding) pbsdsh: task 0 exit status 0 Can someone help me, please ? Regards, Delphine From adaptivecomputing at bridgemailsystem.com Wed Jun 13 07:12:10 2012 From: adaptivecomputing at bridgemailsystem.com (Adaptive Computing) Date: Wed, 13 Jun 2012 06:12:10 -0700 (PDT) Subject: [torqueusers] Optimize Your HPC Environment for Big Data Message-ID: <156856.1339593133290.JavaMail.root@mail2.bms.local> An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120613/2c70406b/attachment.html From fowler at hep.brown.edu Wed Jun 13 09:34:46 2012 From: fowler at hep.brown.edu (Jack Fowler) Date: Wed, 13 Jun 2012 11:34:46 -0400 Subject: [torqueusers] upgrading 2.3.6 and /dev/null issue In-Reply-To: <3b4d1134-0d57-4918-8b1e-8b7138eaecda@mail> References: <3b4d1134-0d57-4918-8b1e-8b7138eaecda@mail> Message-ID: <4FD8B316.1010704@hep.brown.edu> Hi All, We have the /dev/null bug as described here: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=61. We have a default torque-2.3.6 installation on an old Scientific Linux 4.5 system. I'm thinking of upgrading to torque-2.5.12. Note that upgrading to a newer SL isn't possible at this time. Will this fix the /dev/null issue? Is this the correct version? Can I use the installation procedure in the Admin Manual or are there additional issues/gotchas I need to look at? thanks, Jack From dbeer at adaptivecomputing.com Wed Jun 13 10:09:22 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 13 Jun 2012 10:09:22 -0600 Subject: [torqueusers] pbsdsh In-Reply-To: <4FD83257.4040706@univ-reunion.fr> References: <4FD83257.4040706@univ-reunion.fr> Message-ID: Delphine, This is an issue that is fixed in subsequent releases of 4.0.0. Please download 4.0.2: http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.2.tar.gzand the problem will be resolved. David On Wed, Jun 13, 2012 at 12:25 AM, Delphine Ramalingom < delphine.ramalingom at univ-reunion.fr> wrote: > Hi, > > I have installed torque 4.0.0 on a workstation with 32 cores. > I tried to use pbsdsh and want to submit parallel jobs without using mpi. > > Is there a documentation that explains how do that ? > > I have just create un script for torque : > > #!/bin/sh > #PBS -N pbsdsh > #PBS -l select=1:ppn=8:mem=500mb > #PBS -l walltime=03:00:00 > #PBS -q oneday > #PBS -j oe > #PBS -V > cd $PBS_O_WORKDIR > # Run the other script > /usr/local/bin/pbsdsh -v $PBS_O_WORKDIR/mypbsdsh > > And my program mypbsdsh read some files with a number that correspnds to > PBS_TASKNUM > file.$PBS_TASKNUM > > But $PBS_TASKNUM always equals to 2 and only one file is reading. Why ? > > I've also this error : > pbsdsh: spawned task 0 > pbsdsh: spawn event returned: 0 (1 spawns and 0 obits outstanding) > pbsdsh: sending obit for task 2 > pbsdsh: Event poll failed, error TM_ENOTCONNECTED > pbsdsh: reconnected > pbsdsh: sending obit for task 2 > pbsdsh: obit event returned: 0 (0 spawns and 1 obits outstanding) > pbsdsh: task 0 exit status 0 > > Can someone help me, please ? > > Regards, > Delphine > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120613/d8b3e0e7/attachment.html From dbeer at adaptivecomputing.com Wed Jun 13 10:12:37 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 13 Jun 2012 10:12:37 -0600 Subject: [torqueusers] upgrading 2.3.6 and /dev/null issue In-Reply-To: <4FD8B316.1010704@hep.brown.edu> References: <3b4d1134-0d57-4918-8b1e-8b7138eaecda@mail> <4FD8B316.1010704@hep.brown.edu> Message-ID: According to TORQUE's CHANGELOG, this was fixed in 2.5.2. David On Wed, Jun 13, 2012 at 9:34 AM, Jack Fowler wrote: > Hi All, > We have the /dev/null bug as described here: > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=61. We have a > default torque-2.3.6 > installation on an old Scientific Linux 4.5 system. I'm thinking of > upgrading to torque-2.5.12. Note > that upgrading to a newer SL isn't possible at this time. > Will this fix the /dev/null issue? Is this the correct version? Can I > use the installation > procedure in the Admin Manual or are there additional issues/gotchas I > need to look at? > > thanks, > Jack > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120613/68f52c9d/attachment-0001.html From gus at ldeo.columbia.edu Wed Jun 13 15:34:21 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 13 Jun 2012 17:34:21 -0400 Subject: [torqueusers] upgrading 2.3.6 and /dev/null issue In-Reply-To: References: <3b4d1134-0d57-4918-8b1e-8b7138eaecda@mail> <4FD8B316.1010704@hep.brown.edu> Message-ID: <4FD9075D.5020002@ldeo.columbia.edu> Hi Jack This thread may help. It has various interesting insights on how to upgrade Torque: http://www.clusterresources.com/pipermail/torqueusers/2008-April/007224.html Besides, in another thread that I cannot find now, somebody whom I should thank and give credit if I could find the thread, gave this excellent suggestion of saving your current server configuration with: qmgr -c 'p s' > current_torque_setup then reading it back after the new server is installed, qmgr -c < current_torque_setup I hope this helps, Gus Correa On 06/13/2012 12:12 PM, David Beer wrote: > According to TORQUE's CHANGELOG, this was fixed in 2.5.2. > > David > > On Wed, Jun 13, 2012 at 9:34 AM, Jack Fowler > wrote: > > Hi All, > We have the /dev/null bug as described here: > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=61. We have > a default torque-2.3.6 > installation on an old Scientific Linux 4.5 system. I'm thinking of > upgrading to torque-2.5.12. Note > that upgrading to a newer SL isn't possible at this time. > Will this fix the /dev/null issue? Is this the correct version? > Can I use the installation > procedure in the Admin Manual or are there additional issues/gotchas > I need to look at? > > thanks, > Jack > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ianm at uchicago.edu Wed Jun 13 21:41:59 2012 From: ianm at uchicago.edu (Ian Miller) Date: Thu, 14 Jun 2012 03:41:59 +0000 Subject: [torqueusers] PBS_Server just stop responding Message-ID: <843FE493E7B6CA42A6C4682D63AE2D95071748@XM-MBX-02-PROD.ad.uchicago.edu> Hi All, I have a 34 node cluster running CentOS 6 with torque 2.5.7 and maui 3.3.1 When a user submits a job to a node and it takes up pretty much all of the resources on the server I've noticed that qsub and qstat will stop responding. My fix is to restart the pbs_server. My question Is this a config on the mom side that needs to be changed or is this a pbs_server end config that needs to be looked at. Users will submit jobs that from time to time will kill a node but the rest of the cluster should not suffer. -i Ian Miller Systems Administrator Ecology & Evolution Organismal Biology and Anatomy University of Chicago From nt_mahmood at yahoo.com Thu Jun 14 05:53:16 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 14 Jun 2012 04:53:16 -0700 (PDT) Subject: [torqueusers] cannot locate feasible nodes Message-ID: <1339674796.63040.YahooMailNeo@web111705.mail.gq1.yahoo.com> Dear all, I faced the problem before (6 month ago), however I don't remember the solution. There are also discussions about this error however I didn't find a clear solution. Here is the problem: When I run qsub command I get this error: qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes Assumptions: 1) mahmood at srv:~$ pbsnodes -l all srv??????????????????? free ws01???????????????? free ws02???????????????? free ws03???????????????? free ws04???????????????? free ws05???????????????? free 2) mahmood at srv:~$ showq ACTIVE JOBS-------------------- JOBNAME??????????? USERNAME????? STATE? PROC?? REMAINING??????????? STARTTIME ???? 0 Active Jobs?????? 0 of? 166 Processors Active (0.00%) ???????????????????????? 0 of??? 6 Nodes Active????? (0.00%) 3) mahmood at srv:~$ cat /etc/hosts 127.0.0.1?????? localhost.localdomain localhost 192.168.1.100 hpclab.srv srv 192.168.1.1 hpclab.ws01 ws01 192.168.1.2 hpclab.ws02 ws02 192.168.1.3 hpclab.ws03 ws03 192.168.1.4 hpclab.ws04 ws04 192.168.1.5 hpclab.ws05 ws05 4) passwordless ssh works fine: mahmood at srv:~$ ssh ws01 Last login: Thu Jun 14 19:29:50 2012 from hpclab.srv mahmood at ws01:~$ 5) root at srv:~# cat /var/spool/pbs/server_priv/nodes srv np=14 ws01 np=24 ws02 np=32 ws03 np=32 ws04 np=32 ws05 np=32 6) root at srv:~# cat /var/spool/pbs/server_name hpclab.srv 7) root at srv:~# qmgr Max open servers: 4 Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue srvq # create queue srvq set queue srvq queue_type = Execution set queue srvq Priority = 10 set queue srvq resources_default.nodes = 1 set queue srvq resources_default.walltime = 960:00:00 set queue srvq enabled = True set queue srvq started = True # # Create and define queue medium # create queue medium set queue medium queue_type = Execution set queue medium Priority = 80 set queue medium resources_default.nodes = 1 set queue medium resources_default.walltime = 05:00:00 set queue medium enabled = True set queue medium started = True # # Create and define queue small # create queue small set queue small queue_type = Execution set queue small Priority = 100 set queue small resources_default.nodes = 1 set queue small resources_default.walltime = 02:00:00 set queue small enabled = True set queue small started = True # # Create and define queue very_small # create queue very_small set queue very_small queue_type = Execution set queue very_small Priority = 120 set queue very_small resources_default.nodes = 1 set queue very_small resources_default.walltime = 01:00:00 set queue very_small enabled = True set queue very_small started = True # # Create and define queue big # create queue big set queue big queue_type = Execution set queue big Priority = 60 set queue big resources_default.nodes = 1 set queue big resources_default.walltime = 10:00:00 set queue big enabled = True set queue big started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = hpclab.srv set server acl_hosts += srv set server default_queue = very_big set server log_events = 511 set server mail_from = adm set server resources_available.nodect = 166 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 59522 set server server_name = hpclab.srv I will be very thankful for any hint/help. Regards, // Naderan *Mahmood; From knielson at adaptivecomputing.com Thu Jun 14 08:30:40 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 14 Jun 2012 08:30:40 -0600 Subject: [torqueusers] PBS_Server just stop responding In-Reply-To: <843FE493E7B6CA42A6C4682D63AE2D95071748@XM-MBX-02-PROD.ad.uchicago.edu> References: <843FE493E7B6CA42A6C4682D63AE2D95071748@XM-MBX-02-PROD.ad.uchicago.edu> Message-ID: On Wed, Jun 13, 2012 at 9:41 PM, Ian Miller wrote: > Hi All, > I have a 34 node cluster running CentOS 6 with torque 2.5.7 and maui 3.3.1 > When a user submits a job to a node and it takes up pretty much all of the > resources on the server I've noticed that qsub and qstat will stop > responding. My fix is to restart the pbs_server. My question Is this a > config on the mom side that needs to be changed or is this a pbs_server end > config that needs to be looked at. Users will submit jobs that from time > to time will kill a node but the rest of the cluster should not suffer. > > -i > > What else is happening on your system. For example, how many jobs are in the queue? Do you have a user calling qstat over and over? This combination on 2.5 can cause the server to appear hung because it is single threaded and all the time is getting taken up by the qstat calls. I would look at other things along this line as well. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120614/f5cd1b87/attachment.html From knielson at adaptivecomputing.com Thu Jun 14 08:35:25 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 14 Jun 2012 08:35:25 -0600 Subject: [torqueusers] cannot locate feasible nodes In-Reply-To: <1339674796.63040.YahooMailNeo@web111705.mail.gq1.yahoo.com> References: <1339674796.63040.YahooMailNeo@web111705.mail.gq1.yahoo.com> Message-ID: On Thu, Jun 14, 2012 at 5:53 AM, Mahmood Naderan wrote: > Dear all, > I faced the problem before (6 month ago), however I don't remember the > solution. There are also discussions about this error however I didn't find > a clear solution. Here is the problem: When I run qsub command I get this > error: > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > > Assumptions: > 1) > > mahmood at srv:~$ pbsnodes -l all > srv free > ws01 free > ws02 free > ws03 free > ws04 free > ws05 free > > > 2) > > mahmood at srv:~$ showq > ACTIVE JOBS-------------------- > JOBNAME USERNAME STATE PROC REMAINING > STARTTIME > > 0 Active Jobs 0 of 166 Processors Active (0.00%) > 0 of 6 Nodes Active (0.00%) > > > 3) > mahmood at srv:~$ cat /etc/hosts > 127.0.0.1 localhost.localdomain localhost > 192.168.1.100 hpclab.srv srv > 192.168.1.1 hpclab.ws01 ws01 > 192.168.1.2 hpclab.ws02 ws02 > 192.168.1.3 hpclab.ws03 ws03 > 192.168.1.4 hpclab.ws04 ws04 > 192.168.1.5 hpclab.ws05 ws05 > > > 4) passwordless ssh works fine: > mahmood at srv:~$ ssh ws01 > Last login: Thu Jun 14 19:29:50 2012 from hpclab.srv > mahmood at ws01:~$ > > > 5) > root at srv:~# cat /var/spool/pbs/server_priv/nodes > srv np=14 > ws01 np=24 > ws02 np=32 > ws03 np=32 > ws04 np=32 > ws05 np=32 > > > 6) > root at srv:~# cat /var/spool/pbs/server_name > hpclab.srv > > 7) > root at srv:~# qmgr > Max open servers: 4 > Qmgr: print server > # > # Create queues and set their attributes. > # > # > # Create and define queue srvq > # > create queue srvq > set queue srvq queue_type = Execution > set queue srvq Priority = 10 > set queue srvq resources_default.nodes = 1 > set queue srvq resources_default.walltime = 960:00:00 > set queue srvq enabled = True > set queue srvq started = True > # > # Create and define queue medium > # > create queue medium > set queue medium queue_type = Execution > set queue medium Priority = 80 > set queue medium resources_default.nodes = 1 > set queue medium resources_default.walltime = 05:00:00 > set queue medium enabled = True > set queue medium started = True > # > # Create and define queue small > # > create queue small > set queue small queue_type = Execution > set queue small Priority = 100 > set queue small resources_default.nodes = 1 > set queue small resources_default.walltime = 02:00:00 > set queue small enabled = True > set queue small started = True > # > # Create and define queue very_small > # > create queue very_small > set queue very_small queue_type = Execution > set queue very_small Priority = 120 > set queue very_small resources_default.nodes = 1 > set queue very_small resources_default.walltime = 01:00:00 > set queue very_small enabled = True > set queue very_small started = True > # > # Create and define queue big > # > create queue big > set queue big queue_type = Execution > set queue big Priority = 60 > set queue big resources_default.nodes = 1 > set queue big resources_default.walltime = 10:00:00 > set queue big enabled = True > set queue big started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = hpclab.srv > set server acl_hosts += srv > set server default_queue = very_big > set server log_events = 511 > set server mail_from = adm > set server resources_available.nodect = 166 > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 59522 > set server server_name = hpclab.srv > > > > I will be very thankful for any hint/help. > Regards, > > // Naderan *Mahmood; > What version of TORQUE and what is the entire qsub line? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120614/f831e742/attachment-0001.html From fowler at hep.brown.edu Thu Jun 14 08:39:17 2012 From: fowler at hep.brown.edu (Jack Fowler) Date: Thu, 14 Jun 2012 10:39:17 -0400 Subject: [torqueusers] upgrading 2.3.6 and /dev/null issue In-Reply-To: <4FD9075D.5020002@ldeo.columbia.edu> References: <3b4d1134-0d57-4918-8b1e-8b7138eaecda@mail> <4FD8B316.1010704@hep.brown.edu> <4FD9075D.5020002@ldeo.columbia.edu> Message-ID: <4FD9F795.2030703@hep.brown.edu> Hi Gus, Yes, this helps a lot, thanks! cheers, Jack On 06/13/2012 05:34 PM, Gus Correa wrote: > Hi Jack > > This thread may help. It has various interesting insights on > how to upgrade Torque: > > http://www.clusterresources.com/pipermail/torqueusers/2008-April/007224.html > > Besides, in another thread that I cannot find now, > somebody whom I should thank and give credit if > I could find the thread, gave this excellent suggestion > of saving your current server configuration with: > > qmgr -c 'p s' > current_torque_setup > > then reading it back after the new server is installed, > > qmgr -c < current_torque_setup > > I hope this helps, > Gus Correa > > On 06/13/2012 12:12 PM, David Beer wrote: >> According to TORQUE's CHANGELOG, this was fixed in 2.5.2. >> >> David >> >> On Wed, Jun 13, 2012 at 9:34 AM, Jack Fowler > > wrote: >> >> Hi All, >> We have the /dev/null bug as described here: >> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=61. We have >> a default torque-2.3.6 >> installation on an old Scientific Linux 4.5 system. I'm thinking of >> upgrading to torque-2.5.12. Note >> that upgrading to a newer SL isn't possible at this time. >> Will this fix the /dev/null issue? Is this the correct version? >> Can I use the installation >> procedure in the Admin Manual or are there additional issues/gotchas >> I need to look at? >> >> thanks, >> Jack >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From nt_mahmood at yahoo.com Thu Jun 14 09:18:48 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 14 Jun 2012 08:18:48 -0700 (PDT) Subject: [torqueusers] cannot locate feasible nodes In-Reply-To: References: <1339674796.63040.YahooMailNeo@web111705.mail.gq1.yahoo.com> Message-ID: <1339687128.80400.YahooMailNeo@web111702.mail.gq1.yahoo.com> Sorry, our assistant made a mistake in his qsub script. Problem was a typo error in "#PBS -l nodes=sw01" instead of "#PBS -l nodes=ws01" Thanks for your help // Naderan *Mahmood; ________________________________ From: Ken Nielson To: Mahmood Naderan ; Torque Users Mailing List Sent: Thursday, June 14, 2012 7:05 PM Subject: Re: [torqueusers] cannot locate feasible nodes On Thu, Jun 14, 2012 at 5:53 AM, Mahmood Naderan wrote: Dear all, >I faced the problem before (6 month ago), however I don't remember the solution. There are also discussions about this error however I didn't find a clear solution. Here is the problem: When I run qsub command I get this error: > >qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > >Assumptions: >1) > >mahmood at srv:~$ pbsnodes -l all >srv??????????????????? free >ws01???????????????? free >ws02???????????????? free >ws03???????????????? free >ws04???????????????? free >ws05???????????????? free > > >2) > >mahmood at srv:~$ showq >ACTIVE JOBS-------------------- >JOBNAME??????????? USERNAME????? STATE? PROC?? REMAINING??????????? STARTTIME > >???? 0 Active Jobs?????? 0 of? 166 Processors Active (0.00%) >???????????????????????? 0 of??? 6 Nodes Active????? (0.00%) > > >3) >mahmood at srv:~$ cat /etc/hosts >127.0.0.1?????? localhost.localdomain localhost >192.168.1.100 hpclab.srv srv >192.168.1.1 hpclab.ws01 ws01 >192.168.1.2 hpclab.ws02 ws02 >192.168.1.3 hpclab.ws03 ws03 >192.168.1.4 hpclab.ws04 ws04 >192.168.1.5 hpclab.ws05 ws05 > > >4) passwordless ssh works fine: >mahmood at srv:~$ ssh ws01 >Last login: Thu Jun 14 19:29:50 2012 from hpclab.srv >mahmood at ws01:~$ > > >5) >root at srv:~# cat /var/spool/pbs/server_priv/nodes >srv np=14 >ws01 np=24 >ws02 np=32 >ws03 np=32 >ws04 np=32 >ws05 np=32 > > >6) >root at srv:~# cat /var/spool/pbs/server_name >hpclab.srv > >7) >root at srv:~# qmgr >Max open servers: 4 >Qmgr: print server ># ># Create queues and set their attributes. ># ># ># Create and define queue srvq ># >create queue srvq >set queue srvq queue_type = Execution >set queue srvq Priority = 10 >set queue srvq resources_default.nodes = 1 >set queue srvq resources_default.walltime = 960:00:00 >set queue srvq enabled = True >set queue srvq started = True ># ># Create and define queue medium ># >create queue medium >set queue medium queue_type = Execution >set queue medium Priority = 80 >set queue medium resources_default.nodes = 1 >set queue medium resources_default.walltime = 05:00:00 >set queue medium enabled = True >set queue medium started = True ># ># Create and define queue small ># >create queue small >set queue small queue_type = Execution >set queue small Priority = 100 >set queue small resources_default.nodes = 1 >set queue small resources_default.walltime = 02:00:00 >set queue small enabled = True >set queue small started = True ># ># Create and define queue very_small ># >create queue very_small >set queue very_small queue_type = Execution >set queue very_small Priority = 120 >set queue very_small resources_default.nodes = 1 >set queue very_small resources_default.walltime = 01:00:00 >set queue very_small enabled = True >set queue very_small started = True ># ># Create and define queue big ># >create queue big >set queue big queue_type = Execution >set queue big Priority = 60 >set queue big resources_default.nodes = 1 >set queue big resources_default.walltime = 10:00:00 >set queue big enabled = True >set queue big started = True ># ># Set server attributes. ># >set server scheduling = True >set server acl_hosts = hpclab.srv >set server acl_hosts += srv >set server default_queue = very_big >set server log_events = 511 >set server mail_from = adm >set server resources_available.nodect = 166 >set server scheduler_iteration = 600 >set server node_check_rate = 150 >set server tcp_timeout = 6 >set server next_job_number = 59522 >set server server_name = hpclab.srv > > > >I will be very thankful for any hint/help. >Regards, > >// Naderan *Mahmood; > What version of TORQUE and what is the entire qsub line? Ken From atf3 at psu.edu Thu Jun 14 09:18:26 2012 From: atf3 at psu.edu (Alex Frase) Date: Thu, 14 Jun 2012 11:18:26 -0400 Subject: [torqueusers] square brackets in PBS_JOBID and TMPDIR Message-ID: <4FDA00C2.5030707@psu.edu> Hello, We've run into problems with TORQUE's use of square brackets to denote array jobs. This behavior appears to have started in 2.5 and remains as of 4.0.1. Because PBS_JOBID (i.e. "1234[1]") is also used as part of TMPDIR (i.e. "/tmp/1234[1].hostname/"), the square brackets indirectly end up in user scripted shell commands such as: cp $TMPDIR/output* /my/dir/ Such a command can fail for an array job if the []s in TMPDIR confuse the shell, preventing the expansion of the * wildcard; rather than copying all files as expected, cp then reports that there is no such file "output*". A similar problem was posted to this list in 2010: http://www.clusterresources.com/pipermail/torqueusers/2010-July/011114.html However, since that report only raised the issue in the context of the stdout/stderr output filenames, it appears the solution was to simply filter []s out of those filenames only. A similar strategy could work here in the construction of the TMPDIR string, but since users may still want to use PBS_JOBID in a shell context for their own reasons, it might be safer to address both of these issues at their source and avoid using any characters in PBS_JOBID which have special meaning to any of the common shells. Thanks, Alex From dbeer at adaptivecomputing.com Thu Jun 14 09:56:07 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 14 Jun 2012 09:56:07 -0600 Subject: [torqueusers] PBS_Server just stop responding In-Reply-To: References: <843FE493E7B6CA42A6C4682D63AE2D95071748@XM-MBX-02-PROD.ad.uchicago.edu> Message-ID: How did you configure TORQUE? Did you use --with-tcp-retry-limit=? I suggest using 5 there. pbs_server can get stuck retrying different ports for a very long time (over 4.5 hours) because it will retry about 880 different ports to contact a certain node, and sometimes it gets stuck. If you set this limit, you make it so that it doesn't retry more than the number of times that you specify. David On Thu, Jun 14, 2012 at 8:30 AM, Ken Nielson wrote: > On Wed, Jun 13, 2012 at 9:41 PM, Ian Miller wrote: > >> Hi All, >> I have a 34 node cluster running CentOS 6 with torque 2.5.7 and maui 3.3.1 >> When a user submits a job to a node and it takes up pretty much all of >> the resources on the server I've noticed that qsub and qstat will stop >> responding. My fix is to restart the pbs_server. My question Is this a >> config on the mom side that needs to be changed or is this a pbs_server end >> config that needs to be looked at. Users will submit jobs that from time >> to time will kill a node but the rest of the cluster should not suffer. >> >> -i >> >> > What else is happening on your system. For example, how many jobs are in > the queue? Do you have a user calling qstat over and over? This combination > on 2.5 can cause the server to appear hung because it is single threaded > and all the time is getting taken up by the qstat calls. > > I would look at other things along this line as well. > > Ken > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120614/248ed48b/attachment.html From nt_mahmood at yahoo.com Thu Jun 14 11:53:39 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 14 Jun 2012 10:53:39 -0700 (PDT) Subject: [torqueusers] load balancing Message-ID: <1339696419.11961.YahooMailNeo@web111712.mail.gq1.yahoo.com> Hi Assume there are two nodes: w1 np=12 w2 np=12 When I submit 8 jobs, all 8 jobs are run on w2. It is more desirable to have two 50% load on each node rather than a 100% load on one. The 8 jobs are all single threaded not MPI. Is there any way to define such policy? ?Thanks // Naderan *Mahmood; From bircoph at gmail.com Thu Jun 14 13:29:26 2012 From: bircoph at gmail.com (Andrew Savchenko) Date: Thu, 14 Jun 2012 23:29:26 +0400 Subject: [torqueusers] load balancing In-Reply-To: <1339696419.11961.YahooMailNeo@web111712.mail.gq1.yahoo.com> References: <1339696419.11961.YahooMailNeo@web111712.mail.gq1.yahoo.com> Message-ID: <20120614232926.cc2454e4.bircoph@gmail.com> Hello, On Thu, 14 Jun 2012 10:53:39 -0700 (PDT) Mahmood Naderan wrote: > Hi > Assume there are two nodes: > w1 np=12 > w2 np=12 > > When I submit 8 jobs, all 8 jobs are run on w2. It is more desirable to have two 50% load on each node rather than a 100% load on one. The 8 jobs are all single threaded not MPI. Is there any way to define such policy? > http://www.supercluster.org/mailman/listinfo/torqueusers qmgr -c 'set server node_pack = false' should help you. Best regards, Andrew Savchenko -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120614/c3fc95f8/attachment.bin From fcaba at uns.edu.ar Thu Jun 14 13:51:57 2012 From: fcaba at uns.edu.ar (Fernando Caba) Date: Thu, 14 Jun 2012 16:51:57 -0300 Subject: [torqueusers] =?iso-8859-1?q?pbs=3A=5Fmom_don=B4t_start_on_a_node?= =?iso-8859-1?q?_startup?= Message-ID: <4FDA40DD.30103@uns.edu.ar> Hi, i have runnig torque 3.0.1. When some node is power up pbs_mom don?t start. In /var/log/messages remains this: Jun 13 11:03:29 n10 kernel: pbs_mom[4152]: segfault at 0000000000000000 rip 000000000041a7c5 rsp 00007fffd76761d0 error 4 pbs_mom always fail in startup. If i log in the node and run service pbs_mom start, pbs_mom start succesful. Somebody have an idea? Regards -- ---------------------------------------------------- Ing. Fernando Caba Director General de Telecomunicaciones Universidad Nacional del Sur http://www.dgt.uns.edu.ar Tel/Fax: (54)-291-4595166 Tel: (54)-291-4595101 int. 2050 Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina ---------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4533 bytes Desc: Firma criptogr??fica S/MIME Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120614/86bd690d/attachment.bin From jwkeller at alaska.edu Thu Jun 14 15:48:56 2012 From: jwkeller at alaska.edu (John Keller) Date: Thu, 14 Jun 2012 13:48:56 -0800 Subject: [torqueusers] torque 2.5.11 - Sect 1.1.4 TORQUE as a service Message-ID: I gather from p. 12 of the Administrator's Guide that TORQUE can be installed as a service. On the server, on each node, or on all the above (in my SUSE v 12 cluster). It looks to me like on the server, suse.pbs_server and suse.pbs_sched would go into /etc/init.d/. and then be turned on with insserv commands. And on each compute node, suse.pbs_mom would go into the same directory, and likewise be turned on by an insserv command. Is this correct? John Keller From bdandrus at nps.edu Thu Jun 14 16:18:09 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 14 Jun 2012 22:18:09 +0000 Subject: [torqueusers] setting checkpoint defaults for queue Message-ID: Ok, Small bug that has been, well, bugging me for a bit. [root at hamming ~]# qmgr Max open servers: 10239 Qmgr: set queue short checkpoint_defaults = enabled Qmgr: set queue short checkpoint_defaults += shutdown qmgr obj=short svr=default: PBS server internal error checkpoint_defaults Hmmm. Ok, so I cannot use "+=" for setting the checkpoint_defaults. I'll go by the docs at http://www.adaptivecomputing.com/resources/docs/torque/3-0-5/a.bserverparameters.php#checkpoint_defaults Qmgr: set queue short checkpoint_defaults = "enabled, shutdown" That seems to work.. Qmgr: p q short # # Create queues and set their attributes. # # # Create and define queue short # create queue short set queue short queue_type = Execution set queue short checkpoint_defaults = enabled set queue short checkpoint_defaults += shutdown set queue short enabled = True set queue short started = True What? Wait... That += syntax is not valid, so it should not be showing it that way... This is irritating because I use qmgr -c 'p s' to backup my configurations. So, can we please get checkpoint_defaults to allow += OR stop it from being shown that way when I print the configuration? Thanks! Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From gus at ldeo.columbia.edu Thu Jun 14 17:01:05 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 14 Jun 2012 19:01:05 -0400 Subject: [torqueusers] torque 2.5.11 - Sect 1.1.4 TORQUE as a service In-Reply-To: References: Message-ID: <4FDA6D31.1090104@ldeo.columbia.edu> On 06/14/2012 05:48 PM, John Keller wrote: > I gather from p. 12 of the Administrator's Guide that TORQUE can be > installed as a service. On the server, on each node, or on all the > above (in my SUSE v 12 cluster). > It looks to me like on the server, suse.pbs_server and suse.pbs_sched > would go into /etc/init.d/. and then be turned on with insserv > commands. And on each compute node, suse.pbs_mom would go into the > same directory, and likewise be turned on by an insserv command. > Is this correct? > John Keller > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers Hi John I can't speak about Suse, but what you say sounds very similar to what we do on RHEL/CentOS/Fedora: Put the pbs_server startup script in /etc/init.d in the Torque server, put the pbs_mom startup script in /etc/init.d in the various compute nodes. The use chkconfig [which I presume the Suse counterpart is insserv] to configure each service to startup on boot at various runlevels, respectively on the server and the compute nodes. To activate both, either reboot everything, or just start the services manually this first time with 'service start pbs_server' on the Torque server, and likewise for pbs_mom on all nodes. A point your description seems to be missing is the scheduler. You could be using pbs_sched or maui [not both]. Either way, the procedure to start them as services is similar to that used to start pbs_server on the Torque server. Make sure you create a $TORQUE/server_priv/nodes file that reflects the nodes and processors/cores_per_node of your cluster. Make sure you configure your server too, create queues, enable job scheduling, etc. These are qmgr commands that you can gather in a text file, say torque_server_configuration, for ease of documentation, then just run them with: qmgr -c < torque_server_configuration I hope this helps, Gus Correa From nt_mahmood at yahoo.com Fri Jun 15 00:23:00 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Thu, 14 Jun 2012 23:23:00 -0700 (PDT) Subject: [torqueusers] load balancing In-Reply-To: <20120614232926.cc2454e4.bircoph@gmail.com> References: <1339696419.11961.YahooMailNeo@web111712.mail.gq1.yahoo.com> <20120614232926.cc2454e4.bircoph@gmail.com> Message-ID: <1339741380.5189.YahooMailNeo@web111704.mail.gq1.yahoo.com> >qmgr -c 'set server node_pack = false' ?Thanks for your help. I can not find documentation about this parameter in "4.1 queue configuration" http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml // Naderan *Mahmood; ----- Original Message ----- From: Andrew Savchenko To: Mahmood Naderan ; Torque Users Mailing List Cc: Sent: Thursday, June 14, 2012 11:59 PM Subject: Re: [torqueusers] load balancing Hello, On Thu, 14 Jun 2012 10:53:39 -0700 (PDT) Mahmood Naderan wrote: > Hi > Assume there are two nodes: > w1 np=12 > w2 np=12 > > When I submit 8 jobs, all 8 jobs are run on w2. It is more desirable to have two 50% load on each node rather than a 100% load on one. The 8 jobs are all single threaded not MPI. Is there any way to define such policy? > http://www.supercluster.org/mailman/listinfo/torqueusers qmgr -c 'set server node_pack = false' should help you. Best regards, Andrew Savchenko From andre.gemuend at scai.fraunhofer.de Fri Jun 15 02:56:37 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Fri, 15 Jun 2012 10:56:37 +0200 (CEST) Subject: [torqueusers] load balancing In-Reply-To: <1339741380.5189.YahooMailNeo@web111704.mail.gq1.yahoo.com> Message-ID: <1950665656.437475.1339750597125.JavaMail.root@scai.fraunhofer.de> This is because its a server parameter, see here: http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml As far as I know, it only works if you use pbs_sched. Otherwise you need a similar setting in your scheduler, e.g. Maui. For the latter, it should be NODEALLOCATIONPOLICY, but afaik there is no predefined setting that is similar. You'll have to experiment with CPULOAD or PRIORITY. A good example from the Maui docs is this: Example 1: Favor the fastest nodes with the most available memory which are running the fewest jobs. NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='SPEED+ .01 * AMEM - 10 * JOBCOUNT' Greetings Andr? ----- Urspr?ngliche Mail ----- > >qmgr -c 'set server node_pack = false' > ?Thanks for your help. I can not find documentation about this > ?parameter in "4.1 queue configuration" > ?http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml > > > > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Andrew Savchenko > To: Mahmood Naderan ; Torque Users Mailing List > > Cc: > Sent: Thursday, June 14, 2012 11:59 PM > Subject: Re: [torqueusers] load balancing > > Hello, > > On Thu, 14 Jun 2012 10:53:39 -0700 (PDT) Mahmood Naderan wrote: > > Hi > > Assume there are two nodes: > > w1 np=12 > > w2 np=12 > > > > When I submit 8 jobs, all 8 jobs are run on w2. It is more > > desirable to have two 50% load on each node rather than a 100% > > load on one. The 8 jobs are all single threaded not MPI. Is there > > any way to define such policy? > > http://www.supercluster.org/mailman/listinfo/torqueusers > > qmgr -c 'set server node_pack = false' > should help you. > > Best regards, > Andrew Savchenko > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From nt_mahmood at yahoo.com Fri Jun 15 04:02:05 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Fri, 15 Jun 2012 03:02:05 -0700 (PDT) Subject: [torqueusers] load balancing In-Reply-To: <1950665656.437475.1339750597125.JavaMail.root@scai.fraunhofer.de> References: <1339741380.5189.YahooMailNeo@web111704.mail.gq1.yahoo.com> <1950665656.437475.1339750597125.JavaMail.root@scai.fraunhofer.de> Message-ID: <1339754525.41248.YahooMailNeo@web111715.mail.gq1.yahoo.com> >This is because its a server parameter, see here: >http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml Thanks. >As far as I know, it only works if you use pbs_sched. It is not stated in the document? Did you try? // Naderan *Mahmood; ----- Original Message ----- From: Andr? Gem?nd To: Mahmood Naderan ; Torque Users Mailing List Cc: Sent: Friday, June 15, 2012 1:26 PM Subject: Re: [torqueusers] load balancing This is because its a server parameter, see here: http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml As far as I know, it only works if you use pbs_sched. Otherwise you need a similar setting in your scheduler, e.g. Maui. For the latter, it should be NODEALLOCATIONPOLICY, but afaik there is no predefined setting that is similar. You'll have to experiment with CPULOAD or PRIORITY. A good example from the Maui docs is this: Example 1: Favor the fastest nodes with the most available memory which are running the fewest jobs. NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='SPEED+ .01 * AMEM - 10 * JOBCOUNT' Greetings Andr? ----- Urspr?ngliche Mail ----- > >qmgr -c 'set server node_pack = false' > ?Thanks for your help. I can not find documentation about this > ?parameter in "4.1 queue configuration" > ?http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml > > > > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Andrew Savchenko > To: Mahmood Naderan ; Torque Users Mailing List > > Cc: > Sent: Thursday, June 14, 2012 11:59 PM > Subject: Re: [torqueusers] load balancing > > Hello, > > On Thu, 14 Jun 2012 10:53:39 -0700 (PDT) Mahmood Naderan wrote: > > Hi > > Assume there are two nodes: > > w1 np=12 > > w2 np=12 > > > > When I submit 8 jobs, all 8 jobs are run on w2. It is more > > desirable to have two 50% load on each node rather than a 100% > > load on one. The 8 jobs are all single threaded not MPI. Is there > > any way to define such policy? > > http://www.supercluster.org/mailman/listinfo/torqueusers > > qmgr -c 'set server node_pack = false' > should help you. > > Best regards, > Andrew Savchenko > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From delphine.ramalingom at univ-reunion.fr Fri Jun 15 06:56:38 2012 From: delphine.ramalingom at univ-reunion.fr (Delphine Ramalingom) Date: Fri, 15 Jun 2012 16:56:38 +0400 Subject: [torqueusers] torque 4.0.2 In-Reply-To: References: <4FD83257.4040706@univ-reunion.fr> Message-ID: <4FDB3106.5000900@univ-reunion.fr> Dear David, I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root. I've installed the default pbs_sched. momctl diagnoses that no local jobs detected : that's wrong... Have you got an idea what is the problem ? thanks. # qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 29.metis ExampleJob dramalin 0 Q batch 32.metis ExampleJob dramalin 0 Q batch # momctl -h metis.univ.run -d 0 Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807 Server[0]: metis.univ.run (10.90.0.12:15001) Last Msg From Server: 281 seconds (DeleteJob) Last Msg To Server: 41 seconds HomeDirectory: /var/spool/torque/mom_priv MOM active: 1947 seconds LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) NOTE: no local jobs detected diagnostics complete # momctl -p 15002 -h metis.univ.run -d 3 ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success : 0 - Success) delphine Le 13/06/12 20:09, David Beer a ?crit : > Delphine, > > This is an issue that is fixed in subsequent releases of 4.0.0. Please > download 4.0.2: > http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.2.tar.gz > and the problem will be resolved. > > David > From dbeer at adaptivecomputing.com Fri Jun 15 09:45:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 15 Jun 2012 09:45:42 -0600 Subject: [torqueusers] torque 4.0.2 In-Reply-To: <4FDB3106.5000900@univ-reunion.fr> References: <4FD83257.4040706@univ-reunion.fr> <4FDB3106.5000900@univ-reunion.fr> Message-ID: I don't think that pbs_sched is the way to go for a basic setup - I recommend Maui. I think pbs_sched takes some more work before it will actually start scheduling (perhaps someone else with more experience with pbs_sched can offer some quick setup steps?) but once you get Maui talking to pbs_server it will run jobs for you. I recommend you go that way. David On Fri, Jun 15, 2012 at 6:56 AM, Delphine Ramalingom < delphine.ramalingom at univ-reunion.fr> wrote: > Dear David, > > I've installed torque 4.0.2, but job stay in queue unless I make a qrun > as root. > I've installed the default pbs_sched. > momctl diagnoses that no local jobs detected : that's wrong... > > Have you got an idea what is the problem ? thanks. > > # qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 29.metis ExampleJob dramalin 0 Q > batch > 32.metis ExampleJob dramalin 0 Q > batch > > > # momctl -h metis.univ.run -d 0 > > Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807 > Server[0]: metis.univ.run (10.90.0.12:15001) > Last Msg From Server: 281 seconds (DeleteJob) > Last Msg To Server: 41 seconds > HomeDirectory: /var/spool/torque/mom_priv > MOM active: 1947 seconds > LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) > NOTE: no local jobs detected > > diagnostics complete > > # momctl -p 15002 -h metis.univ.run -d 3 > ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success : > 0 - Success) > > delphine > > > Le 13/06/12 20:09, David Beer a ?crit : > > Delphine, > > > > This is an issue that is fixed in subsequent releases of 4.0.0. Please > > download 4.0.2: > > > http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.2.tar.gz > > and the problem will be resolved. > > > > David > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120615/ec560a69/attachment.html From bdandrus at nps.edu Fri Jun 15 13:33:02 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Fri, 15 Jun 2012 19:33:02 +0000 Subject: [torqueusers] torque 4.0.2 In-Reply-To: <4FDB3106.5000900@univ-reunion.fr> References: <4FD83257.4040706@univ-reunion.fr> <4FDB3106.5000900@univ-reunion.fr> Message-ID: Delphine, Check your queues and ensure they are enabled and started. Eg: qmgr -c 'set queue tiny enabled = True' qmgr -c 'set queue tiny started = True' Also on your jobs that all have the same $PBS_TASKNUM, you need to submit them as array jobs (eg #PBS -t 10) Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Delphine Ramalingom Sent: Friday, June 15, 2012 5:57 AM To: Torque Users Mailing List Subject: Re: [torqueusers] torque 4.0.2 Dear David, I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root. I've installed the default pbs_sched. momctl diagnoses that no local jobs detected : that's wrong... Have you got an idea what is the problem ? thanks. # qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 29.metis ExampleJob dramalin 0 Q batch 32.metis ExampleJob dramalin 0 Q batch # momctl -h metis.univ.run -d 0 Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807 Server[0]: metis.univ.run (10.90.0.12:15001) Last Msg From Server: 281 seconds (DeleteJob) Last Msg To Server: 41 seconds HomeDirectory: /var/spool/torque/mom_priv MOM active: 1947 seconds LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) NOTE: no local jobs detected diagnostics complete # momctl -p 15002 -h metis.univ.run -d 3 ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success : 0 - Success) delphine Le 13/06/12 20:09, David Beer a ?crit : > Delphine, > > This is an issue that is fixed in subsequent releases of 4.0.0. Please > download 4.0.2: > http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0 > .2.tar.gz > and the problem will be resolved. > > David > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Fri Jun 15 13:39:32 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 15 Jun 2012 19:39:32 +0000 Subject: [torqueusers] torque 4.0.2 In-Reply-To: References: <4FD83257.4040706@univ-reunion.fr> <4FDB3106.5000900@univ-reunion.fr> Message-ID: <242421BFAF465844BE24EB90BB97E2210909660C@ITSDAG1D.its.iastate.edu> Delphine, Things to check: 1) firewall between compute nodes and head node without Torque ports open to compute nodes. 2) Wrong name in /var/spool/torque/server_name 3) cluster is on an internal 172.16 network and head node has two Ethernet connections, a 172.16 internal IP address on eth1 for use as the cluster (named metis) and a routable IP address on eth0 for accesss to the outside world. For 3, I have fixed this by using 172.16.100.1 metis external.name.iastate.edu external.name I also set the hostname to metis with /usr/bin/system-config-network The metis goes into /var/spool/torque/server_name on head nodes (metis) and on all compute nodes. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Friday, June 15, 2012 10:46 AM To: Torque Users Mailing List Subject: Re: [torqueusers] torque 4.0.2 I don't think that pbs_sched is the way to go for a basic setup - I recommend Maui. I think pbs_sched takes some more work before it will actually start scheduling (perhaps someone else with more experience with pbs_sched can offer some quick setup steps?) but once you get Maui talking to pbs_server it will run jobs for you. I recommend you go that way. David On Fri, Jun 15, 2012 at 6:56 AM, Delphine Ramalingom > wrote: Dear David, I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root. I've installed the default pbs_sched. momctl diagnoses that no local jobs detected : that's wrong... Have you got an idea what is the problem ? thanks. # qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 29.metis ExampleJob dramalin 0 Q batch 32.metis ExampleJob dramalin 0 Q batch # momctl -h metis.univ.run -d 0 Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807 Server[0]: metis.univ.run (10.90.0.12:15001) Last Msg From Server: 281 seconds (DeleteJob) Last Msg To Server: 41 seconds HomeDirectory: /var/spool/torque/mom_priv MOM active: 1947 seconds LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) NOTE: no local jobs detected diagnostics complete # momctl -p 15002 -h metis.univ.run -d 3 ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success : 0 - Success) delphine Le 13/06/12 20:09, David Beer a ?crit : > Delphine, > > This is an issue that is fixed in subsequent releases of 4.0.0. Please > download 4.0.2: > http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.2.tar.gz > and the problem will be resolved. > > David > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120615/543e9f80/attachment.html From gus at ldeo.columbia.edu Fri Jun 15 14:19:18 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 15 Jun 2012 16:19:18 -0400 Subject: [torqueusers] torque 4.0.2 In-Reply-To: References: <4FD83257.4040706@univ-reunion.fr> <4FDB3106.5000900@univ-reunion.fr> Message-ID: <4FDB98C6.3060001@ldeo.columbia.edu> On 06/15/2012 03:33 PM, Andrus, Brian Contractor wrote: > Delphine, > > Check your queues and ensure they are enabled and started. Eg: > qmgr -c 'set queue tiny enabled = True' > qmgr -c 'set queue tiny started = True' > > > Also on your jobs that all have the same $PBS_TASKNUM, you need to submit them as array jobs (eg #PBS -t 10) > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > ... and to enable scheduling: qmgr -c 'set server scheduling = True' *** Can the server name on mom_priv/config be resolved by the compute nodes? Typically in /etc/hosts, and associated to your cluster private subnet. Say: mom_priv/config: $pbsserver headnode /etc/hosts: 192.168.1.1 headnode *** Did you run 'pbsnodes' to see which nodes/moms respond? Did you check the server and mom logs for possible error messages? Did you check /var/log/messages for errors? I hope this helps, Gus Correa > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Delphine Ramalingom > Sent: Friday, June 15, 2012 5:57 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] torque 4.0.2 > > Dear David, > > I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root. > I've installed the default pbs_sched. > momctl diagnoses that no local jobs detected : that's wrong... > > Have you got an idea what is the problem ? thanks. > > # qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 29.metis ExampleJob dramalin 0 Q > batch > 32.metis ExampleJob dramalin 0 Q > batch > > > # momctl -h metis.univ.run -d 0 > > Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807 > Server[0]: metis.univ.run (10.90.0.12:15001) > Last Msg From Server: 281 seconds (DeleteJob) > Last Msg To Server: 41 seconds > HomeDirectory: /var/spool/torque/mom_priv > MOM active: 1947 seconds > LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) > NOTE: no local jobs detected > > diagnostics complete > > # momctl -p 15002 -h metis.univ.run -d 3 > ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success : > 0 - Success) > > delphine > > > Le 13/06/12 20:09, David Beer a ?crit : >> Delphine, >> >> This is an issue that is fixed in subsequent releases of 4.0.0. Please >> download 4.0.2: >> http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0 >> .2.tar.gz >> and the problem will be resolved. >> >> David >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jwkeller at alaska.edu Sun Jun 17 19:19:06 2012 From: jwkeller at alaska.edu (John Keller) Date: Sun, 17 Jun 2012 17:19:06 -0800 Subject: [torqueusers] Job does not run - Torque 2.5.11 Message-ID: Hi all, OK here's a newby question: I installed Torque 2.5.11 on server chemlinux5, installed mom on another openSUSE 12.1 system chemlinux2. I just have the default batch setting. When a user attempts the test job >echo "sleep 30" | qsub , the job appears on the qstat list as "R" status, and nothing happens. I assume that "sleep 30" should be echo-ed on the terminal (although the manual does not say what should happen). Attached is a screenshot of current status. Thanks for any suggestions, John Keller -------------- next part -------------- A non-text attachment was scrubbed... Name: Keller-TorqueScreenshot.png Type: image/png Size: 85008 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120617/d15c920c/attachment-0001.png From delphine.ramalingom at univ-reunion.fr Mon Jun 18 04:28:02 2012 From: delphine.ramalingom at univ-reunion.fr (Delphine Ramalingom) Date: Mon, 18 Jun 2012 14:28:02 +0400 Subject: [torqueusers] torque 4.0.2 In-Reply-To: <4FDB98C6.3060001@ldeo.columbia.edu> References: <4FD83257.4040706@univ-reunion.fr> <4FDB3106.5000900@univ-reunion.fr> <4FDB98C6.3060001@ldeo.columbia.edu> Message-ID: <4FDF02B2.4070602@univ-reunion.fr> Thanks for your suggestions. I think the problem is that I'm on a workstation, a unique server for three daemons pbs_server, pbs_mom and pbs_sched. Delphine Le 16/06/12 00:19, Gus Correa a ?crit : > On 06/15/2012 03:33 PM, Andrus, Brian Contractor wrote: >> Delphine, >> >> Check your queues and ensure they are enabled and started. Eg: >> qmgr -c 'set queue tiny enabled = True' >> qmgr -c 'set queue tiny started = True' >> >> >> Also on your jobs that all have the same $PBS_TASKNUM, you need to submit them as array jobs (eg #PBS -t 10) >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 >> >> > ... and to enable scheduling: > > qmgr -c 'set server scheduling = True' > > *** > > Can the server name on mom_priv/config be resolved by > the compute nodes? > Typically in /etc/hosts, and associated to your cluster > private subnet. Say: > > mom_priv/config: > $pbsserver headnode > > /etc/hosts: > 192.168.1.1 headnode > > *** > Did you run 'pbsnodes' to see which nodes/moms respond? > Did you check the server and mom logs for possible error messages? > Did you check /var/log/messages for errors? > > I hope this helps, > Gus Correa > > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Delphine Ramalingom >> Sent: Friday, June 15, 2012 5:57 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] torque 4.0.2 >> >> Dear David, >> >> I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root. >> I've installed the default pbs_sched. >> momctl diagnoses that no local jobs detected : that's wrong... >> >> Have you got an idea what is the problem ? thanks. >> >> # qstat >> Job id Name User Time Use S Queue >> ------------------------- ---------------- --------------- -------- - ----- >> 29.metis ExampleJob dramalin 0 Q >> batch >> 32.metis ExampleJob dramalin 0 Q >> batch >> >> >> # momctl -h metis.univ.run -d 0 >> >> Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807 >> Server[0]: metis.univ.run (10.90.0.12:15001) >> Last Msg From Server: 281 seconds (DeleteJob) >> Last Msg To Server: 41 seconds >> HomeDirectory: /var/spool/torque/mom_priv >> MOM active: 1947 seconds >> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) >> NOTE: no local jobs detected >> >> diagnostics complete >> >> # momctl -p 15002 -h metis.univ.run -d 3 >> ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success : >> 0 - Success) >> >> delphine >> >> >> Le 13/06/12 20:09, David Beer a ?crit : >>> Delphine, >>> >>> This is an issue that is fixed in subsequent releases of 4.0.0. Please >>> download 4.0.2: >>> http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0 >>> .2.tar.gz >>> and the problem will be resolved. >>> >>> David >>> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > From gus at ldeo.columbia.edu Mon Jun 18 08:33:21 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 18 Jun 2012 10:33:21 -0400 Subject: [torqueusers] Job does not run - Torque 2.5.11 In-Reply-To: References: Message-ID: <4FDF3C31.5000705@ldeo.columbia.edu> On 06/17/2012 09:19 PM, John Keller wrote: > Hi all, > OK here's a newby question: > I installed Torque 2.5.11 on server chemlinux5, installed mom on > another openSUSE 12.1 system chemlinux2. I just have the default batch > setting. When a user attempts the test job>echo "sleep 30" | qsub , > the job appears on the qstat list as "R" status, and nothing happens. > I assume that "sleep 30" should be echo-ed on the terminal (although > the manual does not say what should happen). Attached is a screenshot > of current status. > Thanks for any suggestions, > John Keller > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers Hi John Try to do something more verbose, say: > echo "hostname;date" | qsub to generate output text in STDOUT and perhaps STDERR [two files $PBS_JOBID.[oe] ] To get output on the screen, you need to launch the job interactively, i.e. qsub -I, although 'sleep' will never say a thing! I hope this helps, Gus Correa From kunalgrao at gmail.com Mon Jun 18 13:58:04 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Mon, 18 Jun 2012 15:58:04 -0400 Subject: [torqueusers] Torque 3.0.3 and chroot environment In-Reply-To: References: Message-ID: Hi David, Can you elaborate on the customizations that would be needed ? What kind of source-code modifications would be required ? This might be useful as it will mimic a real cluster. Thanks, Kunal On Thu, Jun 7, 2012 at 1:08 PM, David Beer wrote: > Kunal, > > As of now each will report the same thing. If you wanted them to change > each one, you'd have to modify the code. It wouldn't be too hard to do (the > mom daemons know that they're running multi-mom) but it would take some > customization. > > David > > > On Thu, Jun 7, 2012 at 10:50 AM, Kunal Rao wrote: > >> Hi David, >> >> Thanks for your quick response and for pointing to the multi-mom feature. >> The idea is similar i.e. make a small cluster look bigger with being as >> realistic as possible. >> >> I read through that page and seems like it will do what I want. I had a >> follow up question on that : >> >> - Does each mom read from /proc and report to the head node (pbs_server) >> ? In that case the total cpus , memory, load etc. will be reported same >> from each of them. Can that be isolated and different for each of them to >> mimic >> actual large cluster i.e. each having different number of cpus, memory, >> load etc. >> >> Thanks, >> Kunal >> >> >> >> On Thu, Jun 7, 2012 at 12:16 PM, David Beer wrote: >> >>> Kunal, >>> >>> I have done a chroot environment with TORQUE - it worked fine. I was >>> doing this for testing with sleep jobs, and the chroot was because I didn't >>> want it to interact with anything else on the machine. I'm not sure what >>> you're attempting to accomplish, but you may want to consider looking into >>> the multi-mom feature (available starting in 3.0.0) that we also use a lot >>> for testing. I have actually abandoned my chroot environment in favor of >>> using the multi-moms. >>> >>> >>> http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.8multimom.php >>> >>> David >>> >>> On Thu, Jun 7, 2012 at 10:05 AM, Kunal Rao wrote: >>> >>>> Hi All, >>>> >>>> Has anyone tried chroot environment for Torque 3.0.3 or later version ? >>>> I'm thinking of having multiple chroot environment on the same system, each >>>> representing a compute node and build a cluster. >>>> >>>> So, even though there are say only 2 physical machines ( 1 server and 1 >>>> compute node), we should be able to make a cluster of say 4 nodes. Assuming >>>> that the 1 physical compute node can have 3 chroot environment, >>>> each having its own virtual IP and communicating with the master as 3 >>>> independent compute nodes. Head node / server will see as if there are 4 >>>> nodes and the scheduler will aallocate jobs accordingly. >>>> >>>> Is this feasible and can work without any source code modifications to >>>> pbs server / pbs mom ? >>>> >>>> Thanks, >>>> Kunal >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> >>> -- >>> David Beer | Software Engineer >>> Adaptive Computing >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120618/69841e17/attachment.html From jwkeller at alaska.edu Tue Jun 19 14:38:26 2012 From: jwkeller at alaska.edu (John Keller) Date: Tue, 19 Jun 2012 12:38:26 -0800 Subject: [torqueusers] Job does not run - Torque 2.5.11 In-Reply-To: References: Message-ID: Hi all, Problem solved: I noticed on the mom_log that no job activity was recorded when the jobs were submitted on the server. So something was blocking this traffic. So, I disabled the FIREEWLL on the compute node, and now jobs are running normally. These systems are all behind a university firewall, so its doubtful that this is necessary - or, I could specify a port or ports to allow the PBS traffic. Thank you for your suggestions. John Keller On Sun, Jun 17, 2012 at 5:19 PM, John Keller wrote: > Hi all, > OK here's a newby question: > I installed Torque 2.5.11 on server chemlinux5, installed mom on > another openSUSE 12.1 system chemlinux2. I just have the default batch > setting. When a user attempts the test job >echo "sleep 30" | qsub , > the job appears on the qstat list as "R" status, and nothing happens. > I assume that "sleep 30" should be echo-ed on the terminal (although > the manual does not say what should happen). Attached is a screenshot > of current status. > Thanks for any suggestions, > John Keller From gus at ldeo.columbia.edu Tue Jun 19 15:16:11 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 19 Jun 2012 17:16:11 -0400 Subject: [torqueusers] Job does not run - Torque 2.5.11 In-Reply-To: References: Message-ID: <4FE0EC1B.6030706@ldeo.columbia.edu> On 06/19/2012 04:38 PM, John Keller wrote: > Hi all, > Problem solved: I noticed on the mom_log that no job activity was > recorded when the jobs were submitted on the server. So something was > blocking this traffic. So, I disabled the FIREEWLL on the compute > node, and now jobs are running normally. > These systems are all behind a university firewall, so its doubtful > that this is necessary - or, I could specify a port or ports to allow > the PBS traffic. > Thank you for your suggestions. > John Keller > Hi John A common setup in a cluster is to have a head node with one IP/interface/port in a public network [or in the organization private network], and another one in a private [i.e. private to the cluster] subnet. The compute nodes have IP/interface/port also in this cluster-private subnet [say, 192.168.1.0/255.255.255.0], but normally they don't have public IP [or organization/university-wide IP]. Typically a switch [or a bunch of switches] implements this cluster-private subnet. The head node can provide NAT from the compute nodes to the public side of the head node, if this is useful [say, to allow the nodes to reach a license server, etc]. Firewalls in general cause more headaches than joy in the compute nodes. The problem you saw with the Torque moms will probably happen with MPI, NFS, etc, and you might end up with a complex set of firewall rules to open various specific ports for a number of different items. You can keep a firewall on the head node, if you want, besides the university/organization firewall. However, if your nodes are piggy-backing the university/organization network, and there is no cluster-private subnet, then you may need to check if there are university-wide policies requiring firewalls, before you remove them. If firewalls are required things may become more complicated. Otherwise, you could try the scheme above also. However, [beowulf] clusters work in a much simpler way if there is only one login/entry point: the head node [which could also be a few login nodes], with the compute nodes hidden in a cluster-private subnet. Anyway, these are guesses. I don't even know if you are trying to setup a beowulf cluster or something else. Everything may depend on how your Torque server [head node?] and your mom sisters [compute nodes] are connected to each other, and how flexible this scheme is to be adjusted to fit your goals. It also depends on what you are trying to achieve. [E.g. running parallel jobs, or running only serial jobs as a farm of nodes, allowing or not allowing direct login to the compute nodes via public/university IP, or single login point via head node, etc]. I hope this helps, Gus Correa > > On Sun, Jun 17, 2012 at 5:19 PM, John Keller wrote: >> Hi all, >> OK here's a newby question: >> I installed Torque 2.5.11 on server chemlinux5, installed mom on >> another openSUSE 12.1 system chemlinux2. I just have the default batch >> setting. When a user attempts the test job>echo "sleep 30" | qsub , >> the job appears on the qstat list as "R" status, and nothing happens. >> I assume that "sleep 30" should be echo-ed on the terminal (although >> the manual does not say what should happen). Attached is a screenshot >> of current status. >> Thanks for any suggestions, >> John Keller > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Tue Jun 19 16:39:42 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 19 Jun 2012 16:39:42 -0600 Subject: [torqueusers] setting checkpoint defaults for queue In-Reply-To: References: Message-ID: Brian, I have created an internal ticket for this and it will be out in a future release. Ken On Thu, Jun 14, 2012 at 4:18 PM, Andrus, Brian Contractor wrote: > Ok, > > Small bug that has been, well, bugging me for a bit. > > [root at hamming ~]# qmgr > Max open servers: 10239 > Qmgr: set queue short checkpoint_defaults = enabled > Qmgr: set queue short checkpoint_defaults += shutdown > qmgr obj=short svr=default: PBS server internal error > checkpoint_defaults > > Hmmm. Ok, so I cannot use "+=" for setting the checkpoint_defaults. I'll > go by the docs at > http://www.adaptivecomputing.com/resources/docs/torque/3-0-5/a.bserverparameters.php#checkpoint_defaults > > Qmgr: set queue short checkpoint_defaults = "enabled, shutdown" > > That seems to work.. > > Qmgr: p q short > # > # Create queues and set their attributes. > # > # > # Create and define queue short > # > create queue short > set queue short queue_type = Execution > set queue short checkpoint_defaults = enabled > set queue short checkpoint_defaults += shutdown > set queue short enabled = True > set queue short started = True > What? Wait... That += syntax is not valid, so it should not be showing it > that way... > This is irritating because I use qmgr -c 'p s' to backup my configurations. > > > So, can we please get checkpoint_defaults to allow += OR stop it from > being shown that way when I print the configuration? > > Thanks! > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120619/04e82ea7/attachment.html From jwkeller at alaska.edu Wed Jun 20 15:28:55 2012 From: jwkeller at alaska.edu (John Keller) Date: Wed, 20 Jun 2012 13:28:55 -0800 Subject: [torqueusers] Torque 2.5 error - key verification failed Message-ID: Hi all, I have Torque 2.5.11 installed on a server and compute node, both running SUSE 12.1. "qmgr -c 'p s' " and "pbsnodes -a " give normal output. pbs_sched is running. However when I do "qsub test.sh" where the script file contains a simple command, the output ends up in the /undelivered directory on the compute node, and "qstat -f" shows "unable to copy file....key verification failed.." Does torque require passwordless SSH or RSH to be working properly? John Keller From jwkeller at alaska.edu Wed Jun 20 23:33:56 2012 From: jwkeller at alaska.edu (John Keller) Date: Wed, 20 Jun 2012 21:33:56 -0800 Subject: [torqueusers] re Torque 2.5.11 job output stuck on compute node Message-ID: Hi All, (That) problem is solved: I just needed to read further down in the Admin Manual concerning SCP and SSH. My systems have only local file systems, which requires SSH to move files from server to compute node, and back. Also, with openSUSE 12.1, one does need to configure the SSH daemon by editing /etc/ssh/sshd.config file as described. John Keller From regina.guilabert at uib.es Thu Jun 21 03:09:32 2012 From: regina.guilabert at uib.es (Regina Guilabert) Date: Thu, 21 Jun 2012 11:09:32 +0200 Subject: [torqueusers] Problem upgrading torque 4.0.2 Message-ID: Good morning, We are trying to upgrade TORQUE 3.0.2 to 4.0.2. We have no problem or mistakes during the compilation process, but when we run pbs_server it is showed this missage: # pbs_server pbs_server: symbol lookup error: pbs_server: undefined symbol: netrates_mutex Thanks in advance. Regina -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120621/31cd9bcc/attachment.html From pregier at ittc.ku.edu Fri Jun 22 13:14:12 2012 From: pregier at ittc.ku.edu (Phil Regier) Date: Fri, 22 Jun 2012 14:14:12 -0500 (CDT) Subject: [torqueusers] Sporadic UID errors In-Reply-To: <15e6d03c-0fff-44de-abdd-1ded668337ed@zimbra.ittc.ku.edu> Message-ID: <2154ebea-50ef-471b-ab50-a587dbaa6744@zimbra.ittc.ku.edu> Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives. I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example: ... 14289.localhost 14290.localhost 14291.localhost qsub: Bad UID for job execution MSG=User pregier does not exist in server password file 14293.localhost 14294.localhost 14295.localhost ... This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292. Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail? Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post. Phil Regier Student assistant system admininstrator University of Kansas, ITTC From pregier at ittc.ku.edu Fri Jun 22 14:48:19 2012 From: pregier at ittc.ku.edu (Phil Regier) Date: Fri, 22 Jun 2012 15:48:19 -0500 (CDT) Subject: [torqueusers] Sporadic UID errors In-Reply-To: <2154ebea-50ef-471b-ab50-a587dbaa6744@zimbra.ittc.ku.edu> Message-ID: <60576d2c-aca8-4ff9-8714-f3db5ecec57d@zimbra.ittc.ku.edu> Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded? ----- Original Message ----- From: "Phil Regier" To: torqueusers at supercluster.org Sent: Friday, June 22, 2012 2:14:12 PM Subject: Sporadic UID errors Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives. I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example: ... 14289.localhost 14290.localhost 14291.localhost qsub: Bad UID for job execution MSG=User pregier does not exist in server password file 14293.localhost 14294.localhost 14295.localhost ... This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292. Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail? Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post. Phil Regier Student assistant system admininstrator University of Kansas, ITTC From dbeer at adaptivecomputing.com Mon Jun 25 09:01:29 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 25 Jun 2012 09:01:29 -0600 Subject: [torqueusers] Sporadic UID errors In-Reply-To: <60576d2c-aca8-4ff9-8714-f3db5ecec57d@zimbra.ittc.ku.edu> References: <2154ebea-50ef-471b-ab50-a587dbaa6744@zimbra.ittc.ku.edu> <60576d2c-aca8-4ff9-8714-f3db5ecec57d@zimbra.ittc.ku.edu> Message-ID: Phil, We have had other customers/users that had this problem due to LDAP failing sometimes. We added a retry parameter for the moms. You can set it in the mom's config file, just add the line: $ext_pwd_retry If you don't really have users going to machines that they shouldn't go to, then you might want to set this to a fairly high number so that jobs aren't lost unnecessarily. David On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier wrote: > Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying > 4.0.3 snapshot now), and it should also be noted that as part of the stress > test I am constantly watching repeated qstats. The problem does not seem > to appear with 4.0.x as such; might this be related to the switch from a > single-threaded server to multi-threaded? > > ----- Original Message ----- > From: "Phil Regier" > To: torqueusers at supercluster.org > Sent: Friday, June 22, 2012 2:14:12 PM > Subject: Sporadic UID errors > > Sorry if this has been raised (there is another LDAP thread active but I > think the problem is very different) before; I'm still going through the > archives. > > I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible > upgrade from 2.x and have come across some odd behaviors. In particular, > when I submit 1000 small jobs to a fake one-node cluster running Torque > 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can > retrieve specfiles etc. if that would help) and authenticated against LDAP, > I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never > get accepted); for example: > > ... > 14289.localhost > 14290.localhost > 14291.localhost > qsub: Bad UID for job execution MSG=User pregier does not exist in server > password file > > 14293.localhost > 14294.localhost > 14295.localhost > ... > > > This is just a loop; there is no difference between job 14291, 14293, and > what should have been 14292. > > Is this normal? Are there precautions to avoid it, or is this a bug I > should be reporting in more detail? > > Thanks for any suggestions; I'm not terribly experienced with Torque, so > I'm not sure how quickly I should be bringing this sort of thing to the > list. I can provide more details about my setup and/or stress tests, but > didn't want to dump too much useless information in my first post. > > Phil Regier > Student assistant system admininstrator > University of Kansas, ITTC > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120625/cb516fe5/attachment-0001.html From dbeer at adaptivecomputing.com Mon Jun 25 09:02:46 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 25 Jun 2012 09:02:46 -0600 Subject: [torqueusers] Problem upgrading torque 4.0.2 In-Reply-To: References: Message-ID: Regina, This looks like the library didn't get replaced properly. Can you very that pbs_server is picking up the correct libtorque library? David On Thu, Jun 21, 2012 at 3:09 AM, Regina Guilabert wrote: > Good morning, > > We are trying to upgrade TORQUE 3.0.2 to 4.0.2. We have no problem or > mistakes during the compilation process, but when we run pbs_server it is > showed this missage: > > # pbs_server > pbs_server: symbol lookup error: pbs_server: undefined symbol: > netrates_mutex > > Thanks in advance. > Regina > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120625/c433f9fe/attachment.html From pregier at ittc.ku.edu Mon Jun 25 09:36:07 2012 From: pregier at ittc.ku.edu (Phil Regier) Date: Mon, 25 Jun 2012 10:36:07 -0500 (CDT) Subject: [torqueusers] Sporadic UID errors In-Reply-To: Message-ID: Nice; that's pretty slick! I'm sure that will solve the problem; I'll switch back to 3.0.5 in a bit to try it out. Thanks! Phil ----- Original Message ----- From: "David Beer" To: "Torque Users Mailing List" Sent: Monday, June 25, 2012 10:01:29 AM Subject: Re: [torqueusers] Sporadic UID errors Phil, We have had other customers/users that had this problem due to LDAP failing sometimes. We added a retry parameter for the moms. You can set it in the mom's config file, just add the line: $ext_pwd_retry If you don't really have users going to machines that they shouldn't go to, then you might want to set this to a fairly high number so that jobs aren't lost unnecessarily. David On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu > wrote: Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded? ----- Original Message ----- From: "Phil Regier" < pregier at ittc.ku.edu > To: torqueusers at supercluster.org Sent: Friday, June 22, 2012 2:14:12 PM Subject: Sporadic UID errors Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives. I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example: ... 14289.localhost 14290.localhost 14291.localhost qsub: Bad UID for job execution MSG=User pregier does not exist in server password file 14293.localhost 14294.localhost 14295.localhost ... This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292. Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail? Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post. Phil Regier Student assistant system admininstrator University of Kansas, ITTC _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Mon Jun 25 09:38:47 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 25 Jun 2012 09:38:47 -0600 Subject: [torqueusers] Sporadic UID errors In-Reply-To: References: Message-ID: You don't need to switch - this fix is in 4.* as well. David On Mon, Jun 25, 2012 at 9:36 AM, Phil Regier wrote: > Nice; that's pretty slick! I'm sure that will solve the problem; I'll > switch back to 3.0.5 in a bit to try it out. > > Thanks! > > Phil > > ----- Original Message ----- > From: "David Beer" > To: "Torque Users Mailing List" > Sent: Monday, June 25, 2012 10:01:29 AM > Subject: Re: [torqueusers] Sporadic UID errors > > > Phil, > > We have had other customers/users that had this problem due to LDAP > failing sometimes. We added a retry parameter for the moms. You can set it > in the mom's config file, just add the line: > > $ext_pwd_retry > > If you don't really have users going to machines that they shouldn't go > to, then you might want to set this to a fairly high number so that jobs > aren't lost unnecessarily. > > David > > > On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu > > wrote: > > > Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying > 4.0.3 snapshot now), and it should also be noted that as part of the stress > test I am constantly watching repeated qstats. The problem does not seem to > appear with 4.0.x as such; might this be related to the switch from a > single-threaded server to multi-threaded? > > > > ----- Original Message ----- > From: "Phil Regier" < pregier at ittc.ku.edu > > To: torqueusers at supercluster.org > Sent: Friday, June 22, 2012 2:14:12 PM > Subject: Sporadic UID errors > > Sorry if this has been raised (there is another LDAP thread active but I > think the problem is very different) before; I'm still going through the > archives. > > I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible > upgrade from 2.x and have come across some odd behaviors. In particular, > when I submit 1000 small jobs to a fake one-node cluster running Torque > 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can > retrieve specfiles etc. if that would help) and authenticated against LDAP, > I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never > get accepted); for example: > > ... > 14289.localhost > 14290.localhost > 14291.localhost > qsub: Bad UID for job execution MSG=User pregier does not exist in server > password file > > 14293.localhost > 14294.localhost > 14295.localhost > ... > > > This is just a loop; there is no difference between job 14291, 14293, and > what should have been 14292. > > Is this normal? Are there precautions to avoid it, or is this a bug I > should be reporting in more detail? > > Thanks for any suggestions; I'm not terribly experienced with Torque, so > I'm not sure how quickly I should be bringing this sort of thing to the > list. I can provide more details about my setup and/or stress tests, but > didn't want to dump too much useless information in my first post. > > Phil Regier > Student assistant system admininstrator > University of Kansas, ITTC > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120625/c46fc90a/attachment.html From pregier at ittc.ku.edu Mon Jun 25 09:54:03 2012 From: pregier at ittc.ku.edu (Phil Regier) Date: Mon, 25 Jun 2012 10:54:03 -0500 (CDT) Subject: [torqueusers] Sporadic UID errors In-Reply-To: Message-ID: <6be335a9-545d-40ea-9113-c96c0d1a19fe@zimbra.ittc.ku.edu> I'm switching back and forth in an attempt to ascertain which upgrade would be less painful. There is a much uglier issue with 4.0.x that rears its head when I least expect it; I can't quite reproduce it consistently yet, but I'm seeing sporadic configuration database corruption (possibly related to Brian Andrus' "+=" bug from a couple weeks ago?) under heavy loads. If I can reproduce it I'll report the details here (unless I can find it in the list archives). Knowing that there is an easy solution to the only relatively major issue I've seen so far with 3.0.x is a big plus. Thanks again! PR ----- Original Message ----- From: "David Beer" To: "Torque Users Mailing List" Sent: Monday, June 25, 2012 10:38:47 AM Subject: Re: [torqueusers] Sporadic UID errors You don't need to switch - this fix is in 4.* as well. David On Mon, Jun 25, 2012 at 9:36 AM, Phil Regier < pregier at ittc.ku.edu > wrote: Nice; that's pretty slick! I'm sure that will solve the problem; I'll switch back to 3.0.5 in a bit to try it out. Thanks! Phil ----- Original Message ----- From: "David Beer" < dbeer at adaptivecomputing.com > To: "Torque Users Mailing List" < torqueusers at supercluster.org > Sent: Monday, June 25, 2012 10:01:29 AM Subject: Re: [torqueusers] Sporadic UID errors Phil, We have had other customers/users that had this problem due to LDAP failing sometimes. We added a retry parameter for the moms. You can set it in the mom's config file, just add the line: $ext_pwd_retry If you don't really have users going to machines that they shouldn't go to, then you might want to set this to a fairly high number so that jobs aren't lost unnecessarily. David On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu > wrote: Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded? ----- Original Message ----- From: "Phil Regier" < pregier at ittc.ku.edu > To: torqueusers at supercluster.org Sent: Friday, June 22, 2012 2:14:12 PM Subject: Sporadic UID errors Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives. I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example: ... 14289.localhost 14290.localhost 14291.localhost qsub: Bad UID for job execution MSG=User pregier does not exist in server password file 14293.localhost 14294.localhost 14295.localhost ... This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292. Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail? Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post. Phil Regier Student assistant system admininstrator University of Kansas, ITTC _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jwkeller at alaska.edu Mon Jun 25 15:44:17 2012 From: jwkeller at alaska.edu (John Keller) Date: Mon, 25 Jun 2012 13:44:17 -0800 Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? Message-ID: Hi, Can I restart pbs_mom on a compute node? I could not find a reference to this in the Torque admin manual. I made a change to the /var/spool/torque/mom_priv/config file, and so I assume pbs_mom must be restarted, or the compute node has to be rebooted. Thanks, John Keller From sm4082 at nyu.edu Mon Jun 25 16:46:51 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Mon, 25 Jun 2012 18:46:51 -0400 Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? In-Reply-To: References: Message-ID: <2800472A-1F88-414D-9BCB-938478034E58@nyu.edu> Hi John, I believe it is pbs_mom -p It will restart there by picking up changes to config in mom_priv and at the same time not losing jobs running on the node. Hope some one confirms this. It shoukd work fine with this version. Last time when I tried with one of the older versions all the jobs were lost. Please check change logs for this version to see whether it was fixed in this version or later version. -- Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Jun 25, 2012, at 17:44, John Keller wrote: > Hi, > Can I restart pbs_mom on a compute node? I could not find a reference > to this in the Torque admin manual. > I made a change to the /var/spool/torque/mom_priv/config file, and so > I assume pbs_mom must be restarted, or the compute node has to be > rebooted. > Thanks, > John Keller > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From rmckay at adaptivecomputing.com Mon Jun 25 17:08:02 2012 From: rmckay at adaptivecomputing.com (Rick McKay) Date: Mon, 25 Jun 2012 17:08:02 -0600 Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? In-Reply-To: <2800472A-1F88-414D-9BCB-938478034E58@nyu.edu> References: <2800472A-1F88-414D-9BCB-938478034E58@nyu.edu> Message-ID: -p is the default after 2.4.0, and 2.5.11 has this correction in the changelog: b - Fixed a problem where starting pbs_mom with the default -p option would delete running jobs instead of trying to recover them if cpusets were enabled. TRQ-828 Fixes Bugzilla 174. --Rick On Mon, Jun 25, 2012 at 4:46 PM, Sreedhar Manchu wrote: > Hi John, > > I believe it is > > pbs_mom -p > > It will restart there by picking up changes to config in mom_priv and at > the same time not losing jobs running on the node. > > Hope some one confirms this. It shoukd work fine with this version. Last > time when I tried with one of the older versions all the jobs were lost. > Please check change logs for this version to see whether it was fixed in > this version or later version. > > -- Sreedhar. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120625/c0ec6037/attachment.html From regina.guilabert at uib.es Tue Jun 26 01:24:41 2012 From: regina.guilabert at uib.es (Regina Guilabert) Date: Tue, 26 Jun 2012 09:24:41 +0200 Subject: [torqueusers] Problem upgrading torque 4.0.2 In-Reply-To: References: Message-ID: I've tried to remove it and I have the same result. Regina 2012/6/25 David Beer > Regina, > > This looks like the library didn't get replaced properly. Can you very > that pbs_server is picking up the correct libtorque library? > > David > > On Thu, Jun 21, 2012 at 3:09 AM, Regina Guilabert > wrote: > >> Good morning, >> >> We are trying to upgrade TORQUE 3.0.2 to 4.0.2. We have no problem or >> mistakes during the compilation process, but when we run pbs_server it is >> showed this missage: >> >> # pbs_server >> pbs_server: symbol lookup error: pbs_server: undefined symbol: >> netrates_mutex >> >> Thanks in advance. >> Regina >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120626/0c549317/attachment.html From knielson at adaptivecomputing.com Tue Jun 26 07:26:26 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 26 Jun 2012 07:26:26 -0600 Subject: [torqueusers] Problem upgrading torque 4.0.2 In-Reply-To: References: Message-ID: On Tue, Jun 26, 2012 at 1:24 AM, Regina Guilabert wrote: > I've tried to remove it and I have the same result. > > Regina > > > 2012/6/25 David Beer > >> Regina, >> >> This looks like the library didn't get replaced properly. Can you very >> that pbs_server is picking up the correct libtorque library? >> >> David >> >> On Thu, Jun 21, 2012 at 3:09 AM, Regina Guilabert < >> regina.guilabert at uib.es> wrote: >> >>> Good morning, >>> >>> We are trying to upgrade TORQUE 3.0.2 to 4.0.2. We have no problem or >>> mistakes during the compilation process, but when we run pbs_server it is >>> showed this missage: >>> >>> # pbs_server >>> pbs_server: symbol lookup error: pbs_server: undefined symbol: >>> netrates_mutex >>> >>> Thanks in advance. >>> Regina >>> >>> >>> >>> > Regina, Did you do a search for all copies of libtorque? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120626/ed54efe2/attachment.html From regina.guilabert at uib.es Wed Jun 27 01:23:52 2012 From: regina.guilabert at uib.es (Regina Guilabert) Date: Wed, 27 Jun 2012 09:23:52 +0200 Subject: [torqueusers] Problem upgrading torque 4.0.2 In-Reply-To: References: Message-ID: I've seach it, but I'll try again. Regina 2012/6/26 Ken Nielson > ou do a search for all copies of lib -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120627/23b36460/attachment-0001.html From kphillipjack2009 at my.fit.edu Wed Jun 27 09:26:49 2012 From: kphillipjack2009 at my.fit.edu (Kareem Phillip-Jackson) Date: Wed, 27 Jun 2012 11:26:49 -0400 Subject: [torqueusers] Torque not running submitted jobs Message-ID: Every time a job is run on torque it enters the queue and exits, then a message is sent to my email saying this: PBS Job Id: 88.cluster Job Name: myJob Exec host: node02/0 An error has occurred processing your job, see below. Post job file processing error; job 88.cluster on host node02/0Bad UID for job execution REJHOST=node02 MSG=cannot find user 'username' in password file I checked all over the internet and there doesn't seem to be an answer for it, can someone please help me fix this problem. Any help will be appreciated! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120627/bc2efaf9/attachment.html From dbeer at adaptivecomputing.com Wed Jun 27 09:51:22 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 27 Jun 2012 09:51:22 -0600 Subject: [torqueusers] Torque not running submitted jobs In-Reply-To: References: Message-ID: Is your LDAP server down? David On Wed, Jun 27, 2012 at 9:26 AM, Kareem Phillip-Jackson < kphillipjack2009 at my.fit.edu> wrote: > Every time a job is run on torque it enters the queue and exits, then a > message is sent to my email saying this: > > PBS Job Id: 88.cluster > Job Name: myJob > Exec host: node02/0 > An error has occurred processing your job, see below. > Post job file processing error; job 88.cluster on host node02/0Bad UID for job execution REJHOST=node02 MSG=cannot find user 'username' in password file > > > I checked all over the internet and there doesn't seem to be an answer for > it, can someone please help me fix this problem. Any help will be > appreciated! > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120627/8d715d05/attachment.html From kphillipjack2009 at my.fit.edu Wed Jun 27 11:35:53 2012 From: kphillipjack2009 at my.fit.edu (Kareem Phillip-Jackson) Date: Wed, 27 Jun 2012 13:35:53 -0400 Subject: [torqueusers] Torque not running submitted jobs In-Reply-To: References: Message-ID: No, my LDAP server is up and running and I tested it with ssh and LDAP authentication! On Wed, Jun 27, 2012 at 11:51 AM, David Beer wrote: > Is your LDAP server down? > > David > > On Wed, Jun 27, 2012 at 9:26 AM, Kareem Phillip-Jackson < > kphillipjack2009 at my.fit.edu> wrote: > >> Every time a job is run on torque it enters the queue and exits, then a >> message is sent to my email saying this: >> >> PBS Job Id: 88.cluster >> Job Name: myJob >> Exec host: node02/0 >> An error has occurred processing your job, see below. >> Post job file processing error; job 88.cluster on host node02/0Bad UID for job execution REJHOST=node02 MSG=cannot find user 'username' in password file >> >> >> I checked all over the internet and there doesn't seem to be an answer >> for it, can someone please help me fix this problem. Any help will be >> appreciated! >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120627/32c3ac69/attachment.html From dbeer at adaptivecomputing.com Wed Jun 27 11:42:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 27 Jun 2012 11:42:42 -0600 Subject: [torqueusers] Torque not running submitted jobs In-Reply-To: References: Message-ID: The error you're getting is because the moms are unable to verify that they user is authorized on that host. Other users have sometimes had to restart the moms after they fixed LDAP, which you may or may not need to do, but that error tells you that you need to be looking at what is happening there. David On Wed, Jun 27, 2012 at 11:35 AM, Kareem Phillip-Jackson < kphillipjack2009 at my.fit.edu> wrote: > No, my LDAP server is up and running and I tested it with ssh and LDAP > authentication! > > > On Wed, Jun 27, 2012 at 11:51 AM, David Beer wrote: > >> Is your LDAP server down? >> >> David >> >> On Wed, Jun 27, 2012 at 9:26 AM, Kareem Phillip-Jackson < >> kphillipjack2009 at my.fit.edu> wrote: >> >>> Every time a job is run on torque it enters the queue and exits, then a >>> message is sent to my email saying this: >>> >>> PBS Job Id: 88.cluster >>> Job Name: myJob >>> Exec host: node02/0 >>> An error has occurred processing your job, see below. >>> Post job file processing error; job 88.cluster on host node02/0Bad UID for job execution REJHOST=node02 MSG=cannot find user 'username' in password file >>> >>> >>> I checked all over the internet and there doesn't seem to be an answer >>> for it, can someone please help me fix this problem. Any help will be >>> appreciated! >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120627/fc8666a4/attachment.html From kphillipjack2009 at my.fit.edu Wed Jun 27 12:04:35 2012 From: kphillipjack2009 at my.fit.edu (Kareem Phillip-Jackson) Date: Wed, 27 Jun 2012 14:04:35 -0400 Subject: [torqueusers] Torque not running submitted jobs In-Reply-To: References: Message-ID: The LDAP server that I am using is not directly hosted on the cluster that I am running torque on, just the authentication! On Wed, Jun 27, 2012 at 1:42 PM, David Beer wrote: > The error you're getting is because the moms are unable to verify that > they user is authorized on that host. Other users have sometimes had to > restart the moms after they fixed LDAP, which you may or may not need to > do, but that error tells you that you need to be looking at what is > happening there. > > David > > > On Wed, Jun 27, 2012 at 11:35 AM, Kareem Phillip-Jackson < > kphillipjack2009 at my.fit.edu> wrote: > >> No, my LDAP server is up and running and I tested it with ssh and LDAP >> authentication! >> >> >> On Wed, Jun 27, 2012 at 11:51 AM, David Beer > > wrote: >> >>> Is your LDAP server down? >>> >>> David >>> >>> On Wed, Jun 27, 2012 at 9:26 AM, Kareem Phillip-Jackson < >>> kphillipjack2009 at my.fit.edu> wrote: >>> >>>> Every time a job is run on torque it enters the queue and exits, then a >>>> message is sent to my email saying this: >>>> >>>> PBS Job Id: 88.cluster >>>> Job Name: myJob >>>> Exec host: node02/0 >>>> An error has occurred processing your job, see below. >>>> Post job file processing error; job 88.cluster on host node02/0Bad UID for job execution REJHOST=node02 MSG=cannot find user 'username' in password file >>>> >>>> >>>> I checked all over the internet and there doesn't seem to be an answer >>>> for it, can someone please help me fix this problem. Any help will be >>>> appreciated! >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> >>> -- >>> David Beer | Software Engineer >>> Adaptive Computing >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120627/fe2b9f66/attachment-0001.html From jwkeller at alaska.edu Wed Jun 27 13:28:41 2012 From: jwkeller at alaska.edu (John Keller) Date: Wed, 27 Jun 2012 11:28:41 -0800 Subject: [torqueusers] Torque not running submitted jobs In-Reply-To: References: Message-ID: David, How do you restart pbs_mom? (with torque 5.2.11, pbs_mom -p returns an error...) John Keller On Wed, Jun 27, 2012 at 9:42 AM, David Beer wrote: > The error you're getting is because the moms are unable to verify that they > user is authorized on that host. Other users have sometimes had to restart > the moms after they fixed LDAP, which you may or may not need to do, but > that error tells you that you need to be looking at what is happening there. > > David > > > On Wed, Jun 27, 2012 at 11:35 AM, Kareem Phillip-Jackson > wrote: >> >> No, my LDAP server is up and running and I tested it with ssh and LDAP >> authentication! >> >> >> On Wed, Jun 27, 2012 at 11:51 AM, David Beer >> wrote: >>> >>> Is your LDAP server down? >>> >>> David >>> >>> On Wed, Jun 27, 2012 at 9:26 AM, Kareem Phillip-Jackson >>> wrote: >>>> >>>> Every time a job is run on torque it enters the queue and exits, then a >>>> message is sent to my email saying this: >>>> >>>> PBS Job Id: 88.cluster >>>> Job Name: myJob >>>> Exec host: node02/0 >>>> An error has occurred processing your job, see below. >>>> Post job file processing error; job 88.cluster on host node02/0Bad UID >>>> for job execution REJHOST=node02 MSG=cannot find user 'username' in password >>>> file >>>> >>>> >>>> I checked all over the internet and there doesn't seem to be an answer >>>> for it, can someone please help me fix this problem. Any help will be >>>> appreciated! >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> >>> >>> >>> -- >>> David Beer | Software Engineer >>> Adaptive Computing >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From dbeer at adaptivecomputing.com Wed Jun 27 14:06:54 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 27 Jun 2012 14:06:54 -0600 Subject: [torqueusers] Torque not running submitted jobs In-Reply-To: References: Message-ID: On Wed, Jun 27, 2012 at 1:28 PM, John Keller wrote: > David, > How do you restart pbs_mom? (with torque 5.2.11, pbs_mom -p returns an > error...) > John Keller > > momctl -s shuts it down, and then pbs_mom restarts it. David > On Wed, Jun 27, 2012 at 9:42 AM, David Beer > wrote: > > The error you're getting is because the moms are unable to verify that > they > > user is authorized on that host. Other users have sometimes had to > restart > > the moms after they fixed LDAP, which you may or may not need to do, but > > that error tells you that you need to be looking at what is happening > there. > > > > David > > > > > > On Wed, Jun 27, 2012 at 11:35 AM, Kareem Phillip-Jackson > > wrote: > >> > >> No, my LDAP server is up and running and I tested it with ssh and LDAP > >> authentication! > >> > >> > >> On Wed, Jun 27, 2012 at 11:51 AM, David Beer < > dbeer at adaptivecomputing.com> > >> wrote: > >>> > >>> Is your LDAP server down? > >>> > >>> David > >>> > >>> On Wed, Jun 27, 2012 at 9:26 AM, Kareem Phillip-Jackson > >>> wrote: > >>>> > >>>> Every time a job is run on torque it enters the queue and exits, then > a > >>>> message is sent to my email saying this: > >>>> > >>>> PBS Job Id: 88.cluster > >>>> Job Name: myJob > >>>> Exec host: node02/0 > >>>> An error has occurred processing your job, see below. > >>>> Post job file processing error; job 88.cluster on host node02/0Bad UID > >>>> for job execution REJHOST=node02 MSG=cannot find user 'username' in > password > >>>> file > >>>> > >>>> > >>>> I checked all over the internet and there doesn't seem to be an answer > >>>> for it, can someone please help me fix this problem. Any help will be > >>>> appreciated! > >>>> > >>>> _______________________________________________ > >>>> torqueusers mailing list > >>>> torqueusers at supercluster.org > >>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>> > >>> > >>> > >>> > >>> -- > >>> David Beer | Software Engineer > >>> Adaptive Computing > >>> > >>> > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120627/b2245b39/attachment.html From Gareth.Williams at csiro.au Wed Jun 27 16:01:15 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 28 Jun 2012 08:01:15 +1000 Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C62010529679F22@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: John Keller [mailto:jwkeller at alaska.edu] > Sent: Tuesday, 26 June 2012 7:44 AM > To: Torque Users Mailing List > Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? > > Hi, > Can I restart pbs_mom on a compute node? I could not find a reference > to this in the Torque admin manual. > I made a change to the /var/spool/torque/mom_priv/config file, and so > I assume pbs_mom must be restarted, or the compute node has to be > rebooted. > Thanks, > John Keller Hi John, You can use momctl to make pbs_mom read a config file (-r option) without a restart. Gareth > momctl no command specified USAGE: momctl [ -c {JOB|'all'} ] // CLEAR STALE JOB [ -C ] // CYCLE [ -d DIAGLEVEL ] // DIAGNOSE (0 - 3) [ -f HOSTFILE ] // FILE CONTAINING HOSTLIST [ -h HOST[,HOST]... ] // HOSTLIST [ -p PORT ] // PORT [ -q ATTR ] // QUERY ATTRIBUTE [ -r FILE ] // RECONFIG [ -s ] // SHUTDOWN Only one of c, C, d, q, r, or s must be specified, but -q may be used multiple times. HOST may be a hostname or ":property". From sm4082 at nyu.edu Wed Jun 27 16:15:10 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 27 Jun 2012 18:15:10 -0400 Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? In-Reply-To: <007DECE986B47F4EABF823C1FBB19C62010529679F22@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C62010529679F22@exvic-mbx04.nexus.csiro.au> Message-ID: <42550F55-F6D6-4B2F-A523-B805E98D2064@nyu.edu> Hi Gareth, If I add something to pbs_environment, is there a way to force torque to pick up the changes without restarting pbs_mom? Thanks, Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Jun 27, 2012, at 18:01, wrote: >> -----Original Message----- >> From: John Keller [mailto:jwkeller at alaska.edu] >> Sent: Tuesday, 26 June 2012 7:44 AM >> To: Torque Users Mailing List >> Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? >> >> Hi, >> Can I restart pbs_mom on a compute node? I could not find a reference >> to this in the Torque admin manual. >> I made a change to the /var/spool/torque/mom_priv/config file, and so >> I assume pbs_mom must be restarted, or the compute node has to be >> rebooted. >> Thanks, >> John Keller > > Hi John, > > You can use momctl to make pbs_mom read a config file (-r option) without a restart. > > Gareth > >> momctl > no command specified > USAGE: momctl > [ -c {JOB|'all'} ] // CLEAR STALE JOB > [ -C ] // CYCLE > [ -d DIAGLEVEL ] // DIAGNOSE (0 - 3) > [ -f HOSTFILE ] // FILE CONTAINING HOSTLIST > [ -h HOST[,HOST]... ] // HOSTLIST > [ -p PORT ] // PORT > [ -q ATTR ] // QUERY ATTRIBUTE > [ -r FILE ] // RECONFIG > [ -s ] // SHUTDOWN > > Only one of c, C, d, q, r, or s must be specified, but -q may > be used multiple times. HOST may be a hostname or ":property". > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From siegert at sfu.ca Wed Jun 27 19:38:31 2012 From: siegert at sfu.ca (Martin Siegert) Date: Wed, 27 Jun 2012 18:38:31 -0700 Subject: [torqueusers] pbs_mom connections to 127.0.0.1:15001 Message-ID: <20120628013831.GC9263@stikine.sfu.ca> Hi, with torque-4.0.2 (not with 2.5.11) I see a huge number of log entries (every 45s) in /var/log/messages on all computenodes: Jun 27 18:31:15 b414 pbs_mom: LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 127.0.0.1:15001] Jun 27 18:31:15 b414 pbs_mom: LOG_ERROR::mom_server_update_stat, Cannot get a valid stream to send update to server 'localhost' Why would the mom try to contact a server 'localhost'? How can I get rid of this? Cheers, Martin -- Martin Siegert Simon Fraser University Burnaby, British Columbia Canada From Gareth.Williams at csiro.au Wed Jun 27 23:59:58 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 28 Jun 2012 15:59:58 +1000 Subject: [torqueusers] How to restart pbs_mom daemon in torque 2.5.11? In-Reply-To: <42550F55-F6D6-4B2F-A523-B805E98D2064@nyu.edu> References: <007DECE986B47F4EABF823C1FBB19C62010529679F22@exvic-mbx04.nexus.csiro.au> <42550F55-F6D6-4B2F-A523-B805E98D2064@nyu.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C62010529679F2B@exvic-mbx04.nexus.csiro.au> > Hi Gareth, > > If I add something to pbs_environment, is there a way to force torque > to pick up the changes without restarting pbs_mom? > > Thanks, > Sreedhar. Sorry I do not know about when that might get re-read. The documentation does not seem helpful in this case. It may be re-read when each job starts and no extra action be needed. You could look at the source code if all else fails - else try making changes and see if they take. Gareth From delphine.ramalingom at univ-reunion.fr Thu Jun 28 06:29:58 2012 From: delphine.ramalingom at univ-reunion.fr (Delphine Ramalingom) Date: Thu, 28 Jun 2012 16:29:58 +0400 Subject: [torqueusers] job submission torque 4.0.2 Message-ID: <4FEC4E46.8000309@univ-reunion.fr> Dear all, I have a question about the syntax to submit serial jobs with torque 4.0.2. I used to write in my script : #PBS -l select=1:ncpus=1:mem=1gb but I've noticed that my job saty in queueand checkjob tells me that : job is deferred. Reason: NoResources I tried : #PBS -l nodes=1:ppn=1,mem=1gb and my job is running. Waat is the good synthax ? Thanks a lot. Delphine From dbeer at adaptivecomputing.com Thu Jun 28 09:21:17 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 28 Jun 2012 09:21:17 -0600 Subject: [torqueusers] pbs_mom connections to 127.0.0.1:15001 In-Reply-To: <20120628013831.GC9263@stikine.sfu.ca> References: <20120628013831.GC9263@stikine.sfu.ca> Message-ID: Martin, Dumb question, but have you already checked the server_name file, as well as the mom's config file? David On Wed, Jun 27, 2012 at 7:38 PM, Martin Siegert wrote: > Hi, > > with torque-4.0.2 (not with 2.5.11) I see a huge number of log entries > (every 45s) in /var/log/messages on all computenodes: > > Jun 27 18:31:15 b414 pbs_mom: LOG_ERROR::Connection refused (111) in > tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() > failed [rc = 15096] [addr = 127.0.0.1:15001] > Jun 27 18:31:15 b414 pbs_mom: LOG_ERROR::mom_server_update_stat, Cannot > get a valid stream to send update to server 'localhost' > > Why would the mom try to contact a server 'localhost'? > How can I get rid of this? > > Cheers, > Martin > > -- > Martin Siegert > Simon Fraser University > Burnaby, British Columbia > Canada > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120628/22f40a2d/attachment.html From michael.lackner at unileoben.ac.at Thu Jun 28 01:11:41 2012 From: michael.lackner at unileoben.ac.at (Michael Lackner) Date: Thu, 28 Jun 2012 09:11:41 +0200 Subject: [torqueusers] Job submisson: requesting vmem/mem resource In-Reply-To: <4FE2CD93.8090607@unileoben.ac.at> References: <4FE2CD93.8090607@unileoben.ac.at> Message-ID: <4FEC03AD.2020605@unileoben.ac.at> Greetings! I guess this stuff may have been discussed already, but I just couldn't find the answer for my problem anywhere in the documentation or on the web, so maybe somebody here knows or can point me in the right direction.. I am trying to set up torque with different kinds of nodes and so far everything is running quite fine. The nodes are: * 4 machines with CentOS 5.8 32-Bit, 3GB available RAM * 1 machine with CentOS 5.8 64-Bit, 16GB available RAM * 2 machines with CentOS 5.8 64-Bit, 72GB available RAM Software: * torque-2.1.9-1.el5.kb * maui-3.3-4.el5 Now I have defined an arch resource, so users can specifically choose to let their job run on a 32-Bit or 64-Bit machine when requesting an arch using qsub/xpbs. Works perfectly fine. But what I would also want to do is allow users to request mem or vmem as a resource, so smaller 64-Bit jobs go to the 16GB machine and bigger ones to the large 72GB machines, as I always want torqe to choose the weakest or smallest machine that can still do a certain job first and only send a job to a larger machine if really necessary. So in my job script I would have something like: #PBS -l mem=6gb or: #PBS -l vmem=6gb Now the weird thing is, it works as long as the requested memory is below the available memory of the smallest nodes with 3GB. So say, I do mem=2gb, it works fine. But as soon as I request say mem=6gb, the job would always fail and I get nothing on stdout/stderr. I've also tried to set a default vmem of 72GB on the queue and then request vmem=6gb, but the job still got terminated. I also read in the documentation, that one needs to request at least nodes=1 on Linux so that mem/vmem may be requested successfully, so I tried that too, to no avail: #PBS -l mem=6gb,nodes=1 or: #PBS -l vmem=6gb,nodes=1 I checked mom_logs, but it doesn't tell me anything helpful (i replaced our hostname in the log with ""), this is from the node that was supposed to execute the job: == 06/28/2012 08:52:10;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, after files staged, no retry 06/28/2012 08:52:10;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters 06/28/2012 08:52:10;0008; pbs_mom;Job;154.;Job Modified at request of PBS_Server@ == server_logs don't say anything. The maui log only talks about starting, but not about terminating the job, no errors or warnings to be seen. Maybe somebody could advise on how to locate the error I'm making here? Thanks a lot! -- Michael Lackner Lehrstuhl f?r Informationstechnologie (CiT) Montanuniversit?t Leoben Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.at/infotech From d-ulrick at comcast.net Thu Jun 28 08:44:29 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Thu, 28 Jun 2012 09:44:29 -0500 (CDT) Subject: [torqueusers] How to upgrade 3.x -> 4.x Message-ID: Hi, I'm sysadmin for a new Linux-based HPC cluster that consists of login node, two storage nodes, and 60 compute nodes. The hardware vendor preinstalled it with TORQUE 3.0.4 with Moab 6.1.5 as the scheduler. I've been tasked with upgrading TORQUE to a newer release, hopefully 4.0.2. Also, I've been asked to install the TORQUE PAM module on the compute nodes in order to prevent non-root users from logging in interactively to a node unless they have a reservation to that node. At a high level the upgrade path makes sense, but as for the details I'm a bit at sea. I have 10+ years experience with Linux and 20+ years as an IBM mainframe sysprog but I'm brand-new to the HPC as well as TORQUE and Moab. Therefore, I'm hoping you folks can offer some feedback insofar as how I might most easily accomplish the upgrade. Right now TORQUE 3.0.4 and Moab 6.1.5 are running on the first storage node with pbs_mom (also TORQUE 3.0.4) running on all 60 compute nodes. To convert to TORQUE 4.0.2, I've cooked up this plan: 1. Clone the Moab directory tree to prepare for running a test Moab on storage node 1. The new Moab will be connected to the new TORQUE. 2. Modify the production TORQUE and Moab to use only 58 of the compute nodes. The remaining 2 nodes will be used by the new TORQUE. 3. Install a TORQUE 4.0.2 pbs_server on storage node 2. Configure this server to use the 2 compute nodes. 4. Install the TORQUE 4.0.2 pbs_mom and trqauthd services and PAM module on the 2 compute nodes. 5. Modify the configuration under /var/spool/torque on the 2 compute nodes to point to storage node 2. 6. Modify the configuration under the test Moab directory tree to listen on an alternate port and use the 2 compute nodes. 7. Install the new TORQUE commands to an alternate directory on the login node. 8. At this point, I should have the production TORQUE and Moab continuing to manage 58 of the compute nodes, with the new TORQUE and test Moab managing the other 2 nodes. 9. Test the new TORQUE in conjunction with the test Moab, most likely from a 'bash' shell whose environment includes the new TORQUE binaries in the PATH and points to the alternate Moab server via the appropriate environment variable. Assuming that testing confirms that I have a stable configuration for the new TORQUE version, I'd then take over the whole cluster for a few hours and do the following: 1. Shut down both production and test Moab and TORQUE (pbs_server and all pbs_mom services). 2. On the login node, storage node 1, and 58 compute nodes, move the old TORQUE binaries, etc., to an alternate directory and copy in the new TORQUE files in their place. With this done, all 60 compute nodes should have identical TORQUE directory trees. 3. Configure the production Moab and TORQUE to manage all 60 nodes. 4. Start production pbs_server and Moab on storage node 1 plus PAM module and all needed TORQUE services on the 58 compute nodes. 5. At this point all nodes should be using TORQUE 4.0.2. 6. Test before returning the cluster to the users. FYI, I'm also contemplating an upgrade to Moab 7.0.2, but if at all possible I'd rather do that as a separate phase. If I understand correctly, Moab mostly runs on a single server (storage node 1 on our cluster) with end-user binaries copied over to the login node, so after I get some much-needed Moab training I'm hoping that this would be a fairly straightforward upgrade. Does this seem to be a workable plan? Could it be streamlined? Frankly, I'm not particularly thrilled to have to do this kind of an upgrade so early in my TORQUE experience, so if there's any possible way to reduce the pain and suffering involved in the upgrade I'm all ears!!! Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net From rsvancara at wsu.edu Thu Jun 28 09:46:48 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Thu, 28 Jun 2012 15:46:48 +0000 Subject: [torqueusers] How to upgrade 3.x -> 4.x In-Reply-To: References: Message-ID: <1F880D7A2494B346B5AB96481EAE704A1C4B9D@EXMB-03.ad.wsu.edu> Hi, If I understand your process correctly, everything looks good in theory. I am wondering how the Moab license will react to moving it around to different storage nodes? I have always done inplace upgrades where I schedule downtime with the users to perform the upgrade and testing. So I will be interested in knowing how this turns out for you. Thanks, Randall ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Dave Ulrick [d-ulrick at comcast.net] Sent: Thursday, June 28, 2012 7:44 AM To: Torque Users Mailing List Subject: [torqueusers] How to upgrade 3.x -> 4.x Hi, I'm sysadmin for a new Linux-based HPC cluster that consists of login node, two storage nodes, and 60 compute nodes. The hardware vendor preinstalled it with TORQUE 3.0.4 with Moab 6.1.5 as the scheduler. I've been tasked with upgrading TORQUE to a newer release, hopefully 4.0.2. Also, I've been asked to install the TORQUE PAM module on the compute nodes in order to prevent non-root users from logging in interactively to a node unless they have a reservation to that node. At a high level the upgrade path makes sense, but as for the details I'm a bit at sea. I have 10+ years experience with Linux and 20+ years as an IBM mainframe sysprog but I'm brand-new to the HPC as well as TORQUE and Moab. Therefore, I'm hoping you folks can offer some feedback insofar as how I might most easily accomplish the upgrade. Right now TORQUE 3.0.4 and Moab 6.1.5 are running on the first storage node with pbs_mom (also TORQUE 3.0.4) running on all 60 compute nodes. To convert to TORQUE 4.0.2, I've cooked up this plan: 1. Clone the Moab directory tree to prepare for running a test Moab on storage node 1. The new Moab will be connected to the new TORQUE. 2. Modify the production TORQUE and Moab to use only 58 of the compute nodes. The remaining 2 nodes will be used by the new TORQUE. 3. Install a TORQUE 4.0.2 pbs_server on storage node 2. Configure this server to use the 2 compute nodes. 4. Install the TORQUE 4.0.2 pbs_mom and trqauthd services and PAM module on the 2 compute nodes. 5. Modify the configuration under /var/spool/torque on the 2 compute nodes to point to storage node 2. 6. Modify the configuration under the test Moab directory tree to listen on an alternate port and use the 2 compute nodes. 7. Install the new TORQUE commands to an alternate directory on the login node. 8. At this point, I should have the production TORQUE and Moab continuing to manage 58 of the compute nodes, with the new TORQUE and test Moab managing the other 2 nodes. 9. Test the new TORQUE in conjunction with the test Moab, most likely from a 'bash' shell whose environment includes the new TORQUE binaries in the PATH and points to the alternate Moab server via the appropriate environment variable. Assuming that testing confirms that I have a stable configuration for the new TORQUE version, I'd then take over the whole cluster for a few hours and do the following: 1. Shut down both production and test Moab and TORQUE (pbs_server and all pbs_mom services). 2. On the login node, storage node 1, and 58 compute nodes, move the old TORQUE binaries, etc., to an alternate directory and copy in the new TORQUE files in their place. With this done, all 60 compute nodes should have identical TORQUE directory trees. 3. Configure the production Moab and TORQUE to manage all 60 nodes. 4. Start production pbs_server and Moab on storage node 1 plus PAM module and all needed TORQUE services on the 58 compute nodes. 5. At this point all nodes should be using TORQUE 4.0.2. 6. Test before returning the cluster to the users. FYI, I'm also contemplating an upgrade to Moab 7.0.2, but if at all possible I'd rather do that as a separate phase. If I understand correctly, Moab mostly runs on a single server (storage node 1 on our cluster) with end-user binaries copied over to the login node, so after I get some much-needed Moab training I'm hoping that this would be a fairly straightforward upgrade. Does this seem to be a workable plan? Could it be streamlined? Frankly, I'm not particularly thrilled to have to do this kind of an upgrade so early in my TORQUE experience, so if there's any possible way to reduce the pain and suffering involved in the upgrade I'm all ears!!! Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From d-ulrick at comcast.net Thu Jun 28 10:07:06 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Thu, 28 Jun 2012 11:07:06 -0500 (CDT) Subject: [torqueusers] How to upgrade 3.x -> 4.x In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A1C4B9D@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A1C4B9D@EXMB-03.ad.wsu.edu> Message-ID: On Thu, 28 Jun 2012, Svancara, Randall wrote: > Hi, > > If I understand your process correctly, everything looks good in theory. > I am wondering how the Moab license will react to moving it around to > different storage nodes? I have always done inplace upgrades where I > schedule downtime with the users to perform the upgrade and testing. > So I will be interested in knowing how this turns out for you. Good point regarding the Moab license! I suppose I could get around any possible issue by running the new TORQUE pbs_server on the same storage node but listening on a different port. Just before I saw your reply I raised the possibility of an in-place upgrade with my boss. Technically I can see how an in-place upgrade would be much simpler. Since this will be my first upgrade, I can't guarantee that I'll get everything done in X hours or even a single day, so I'd need to ask for more time. Our cluster is so new that not all users are in the habit of using Moab+TORQUE, so perhaps I might be able to convince everyone to do without Moab+TORQUE for a few days. I'd expect that future upgrades would go more quickly so I could just schedule a few hours on a single day for those. Dave -- Dave Ulrick d-ulrick at comcast.net From gabe at msi.umn.edu Thu Jun 28 10:16:29 2012 From: gabe at msi.umn.edu (Gabe Turner) Date: Thu, 28 Jun 2012 11:16:29 -0500 Subject: [torqueusers] How to upgrade 3.x -> 4.x In-Reply-To: References: <1F880D7A2494B346B5AB96481EAE704A1C4B9D@EXMB-03.ad.wsu.edu> Message-ID: <20120628161629.GC13882@fog.msi.umn.edu> On Thu, Jun 28, 2012 at 11:07:06AM -0500, Dave Ulrick wrote: > On Thu, 28 Jun 2012, Svancara, Randall wrote: > > > Hi, > > > > If I understand your process correctly, everything looks good in theory. > > I am wondering how the Moab license will react to moving it around to > > different storage nodes? I have always done inplace upgrades where I > > schedule downtime with the users to perform the upgrade and testing. > > So I will be interested in knowing how this turns out for you. > > Good point regarding the Moab license! I suppose I could get around any > possible issue by running the new TORQUE pbs_server on the same storage > node but listening on a different port. > The only restriction for the Moab license is that it's nodelocked to the network node name (output of 'hostname -n') of the system on which the Moab server process has to run. If you need to move the Moab server from one host to another, you may need to have Adaptive generate a new license file based on the name of the new server host. HTH, Gabe -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From gabe at msi.umn.edu Thu Jun 28 10:21:47 2012 From: gabe at msi.umn.edu (Gabe Turner) Date: Thu, 28 Jun 2012 11:21:47 -0500 Subject: [torqueusers] How to upgrade 3.x -> 4.x In-Reply-To: <20120628161629.GC13882@fog.msi.umn.edu> References: <1F880D7A2494B346B5AB96481EAE704A1C4B9D@EXMB-03.ad.wsu.edu> <20120628161629.GC13882@fog.msi.umn.edu> Message-ID: <20120628162147.GD13882@fog.msi.umn.edu> On Thu, Jun 28, 2012 at 11:16:29AM -0500, Gabe Turner wrote: > The only restriction for the Moab license is that it's nodelocked to the > network node name (output of 'hostname -n') of the system on which the Moab > server process has to run. If you need to move the Moab server from one > host to another, you may need to have Adaptive generate a new license file > based on the name of the new server host. > Sorry, not 'hostname -n', but 'uname -n'. *drinks more coffee* -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From siegert at sfu.ca Thu Jun 28 12:01:31 2012 From: siegert at sfu.ca (Martin Siegert) Date: Thu, 28 Jun 2012 11:01:31 -0700 Subject: [torqueusers] pbs_mom connections to 127.0.0.1:15001 In-Reply-To: References: <20120628013831.GC9263@stikine.sfu.ca> Message-ID: <20120628180131.GA19610@stikine.sfu.ca> Hi David, not a dumb question at all ... I was absoultely sure that I have $pbsserver and server_name setup correctly, but I checked mom_priv/config nevertheless and, yes, the $pbsserver line is correct. But there was something else: $clienthost localhost This line must have been there ever since we started using torque and was probably put there just because we copied the initial configuration from somebody else. Anyway, I did not even know what $clienthost means. Thus I checked the torque-4.0.1 documentation and voila $clienthost is marked as depreciated - use $pbsserver instead. Apparently in earlier versions specifying $clienthost AND $pbsserver did not matter much, but in 4.0.x is does. I just removed all $clienthost from all computenodes and the error messages are gone. Thanks! Cheers, Martin On Thu, Jun 28, 2012 at 09:21:17AM -0600, David Beer wrote: > > Martin, > Dumb question, but have you already checked the server_name file, as > well as the mom's config file? > David > > On Wed, Jun 27, 2012 at 7:38 PM, Martin Siegert <[1]siegert at sfu.ca> > wrote: > > Hi, > with torque-4.0.2 (not with 2.5.11) I see a huge number of log > entries > (every 45s) in /var/log/messages on all computenodes: > Jun 27 18:31:15 b414 pbs_mom: LOG_ERROR::Connection refused (111) in > tcp_connect_sockaddr, Failed when trying to open tcp connection - > connect() failed [rc = 15096] [addr = [2]127.0.0.1:15001] > Jun 27 18:31:15 b414 pbs_mom: LOG_ERROR::mom_server_update_stat, > Cannot get a valid stream to send update to server 'localhost' > Why would the mom try to contact a server 'localhost'? > How can I get rid of this? > Cheers, > Martin > -- > Martin Siegert > Simon Fraser University > Burnaby, British Columbia > Canada > _______________________________________________ > torqueusers mailing list > [3]torqueusers at supercluster.org > [4]http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > > David Beer | Software Engineer > > Adaptive Computing From gus at ldeo.columbia.edu Thu Jun 28 12:12:27 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 28 Jun 2012 14:12:27 -0400 Subject: [torqueusers] job submission torque 4.0.2 In-Reply-To: <4FEC4E46.8000309@univ-reunion.fr> References: <4FEC4E46.8000309@univ-reunion.fr> Message-ID: <4FEC9E8B.4000205@ldeo.columbia.edu> On 06/28/2012 08:29 AM, Delphine Ramalingom wrote: > Dear all, > > I have a question about the syntax to submit serial jobs with torque 4.0.2. > > I used to write in my script : > #PBS -l select=1:ncpus=1:mem=1gb > but I've noticed that my job saty in queueand checkjob tells me that : > job is deferred. Reason: NoResources > > I tried : > #PBS -l nodes=1:ppn=1,mem=1gb > and my job is running. > > Waat is the good synthax ? > > Thanks a lot. > Delphine > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers Hi Delphine You seem to have answered your own question, the syntax that works is the second one, with 'nodes' and 'ppn'. It works for us here too. My recollection is that the first syntax, which already appeared in questions to this mailing list, is from another job scheduler, not from Torque. I hope this helps, Gus Correa From d-ulrick at comcast.net Thu Jun 28 12:13:24 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Thu, 28 Jun 2012 13:13:24 -0500 (CDT) Subject: [torqueusers] How to upgrade 3.x -> 4.x In-Reply-To: <20120628162147.GD13882@fog.msi.umn.edu> References: <1F880D7A2494B346B5AB96481EAE704A1C4B9D@EXMB-03.ad.wsu.edu> <20120628161629.GC13882@fog.msi.umn.edu> <20120628162147.GD13882@fog.msi.umn.edu> Message-ID: On Thu, 28 Jun 2012, Gabe Turner wrote: > On Thu, Jun 28, 2012 at 11:16:29AM -0500, Gabe Turner wrote: >> The only restriction for the Moab license is that it's nodelocked to the >> network node name (output of 'hostname -n') of the system on which the Moab >> server process has to run. If you need to move the Moab server from one >> host to another, you may need to have Adaptive generate a new license file >> based on the name of the new server host. >> > > Sorry, not 'hostname -n', but 'uname -n'. In that case I'd go with the "same node, different ports" approach for running dual Moab services. Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net From delphine.ramalingom at univ-reunion.fr Fri Jun 29 05:30:35 2012 From: delphine.ramalingom at univ-reunion.fr (Delphine Ramalingom) Date: Fri, 29 Jun 2012 15:30:35 +0400 Subject: [torqueusers] torque 4.0.2 In-Reply-To: <4FDF02B2.4070602@univ-reunion.fr> References: <4FD83257.4040706@univ-reunion.fr> <4FDB3106.5000900@univ-reunion.fr> <4FDB98C6.3060001@ldeo.columbia.edu> <4FDF02B2.4070602@univ-reunion.fr> Message-ID: <4FED91DB.9020305@univ-reunion.fr> Hi, I've solved the problem : I've installed maui. thanks. Delphine Le 18/06/12 14:28, Delphine Ramalingom a ?crit : > Thanks for your suggestions. > > I think the problem is that I'm on a workstation, a unique server for > three daemons pbs_server, pbs_mom and pbs_sched. > > Delphine > > Le 16/06/12 00:19, Gus Correa a ?crit : >> On 06/15/2012 03:33 PM, Andrus, Brian Contractor wrote: >>> Delphine, >>> >>> Check your queues and ensure they are enabled and started. Eg: >>> qmgr -c 'set queue tiny enabled = True' >>> qmgr -c 'set queue tiny started = True' >>> >>> >>> Also on your jobs that all have the same $PBS_TASKNUM, you need to submit them as array jobs (eg #PBS -t 10) >>> >>> Brian Andrus >>> ITACS/Research Computing >>> Naval Postgraduate School >>> Monterey, California >>> voice: 831-656-6238 >>> >>> >> ... and to enable scheduling: >> >> qmgr -c 'set server scheduling = True' >> >> *** >> >> Can the server name on mom_priv/config be resolved by >> the compute nodes? >> Typically in /etc/hosts, and associated to your cluster >> private subnet. Say: >> >> mom_priv/config: >> $pbsserver headnode >> >> /etc/hosts: >> 192.168.1.1 headnode >> >> *** >> Did you run 'pbsnodes' to see which nodes/moms respond? >> Did you check the server and mom logs for possible error messages? >> Did you check /var/log/messages for errors? >> >> I hope this helps, >> Gus Correa >> >> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Delphine Ramalingom >>> Sent: Friday, June 15, 2012 5:57 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] torque 4.0.2 >>> >>> Dear David, >>> >>> I've installed torque 4.0.2, but job stay in queue unless I make a qrun as root. >>> I've installed the default pbs_sched. >>> momctl diagnoses that no local jobs detected : that's wrong... >>> >>> Have you got an idea what is the problem ? thanks. >>> >>> # qstat >>> Job id Name User Time Use S Queue >>> ------------------------- ---------------- --------------- -------- - ----- >>> 29.metis ExampleJob dramalin 0 Q >>> batch >>> 32.metis ExampleJob dramalin 0 Q >>> batch >>> >>> >>> # momctl -h metis.univ.run -d 0 >>> >>> Host: metis.univ.run/metis.univ.run Version: 4.0.2 PID: 2807 >>> Server[0]: metis.univ.run (10.90.0.12:15001) >>> Last Msg From Server: 281 seconds (DeleteJob) >>> Last Msg To Server: 41 seconds >>> HomeDirectory: /var/spool/torque/mom_priv >>> MOM active: 1947 seconds >>> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) >>> NOTE: no local jobs detected >>> >>> diagnostics complete >>> >>> # momctl -p 15002 -h metis.univ.run -d 3 >>> ERROR: query[0] 'diag3' failed on metis.univ.run (errno=0 - Success : >>> 0 - Success) >>> >>> delphine >>> >>> >>> Le 13/06/12 20:09, David Beer a ?crit : >>>> Delphine, >>>> >>>> This is an issue that is fixed in subsequent releases of 4.0.0. Please >>>> download 4.0.2: >>>> http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0 >>>> .2.tar.gz >>>> and the problem will be resolved. >>>> >>>> David >>>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > From ytt515 at yahoo.cn Thu Jun 28 04:09:35 2012 From: ytt515 at yahoo.cn (TingtingYang) Date: Thu, 28 Jun 2012 18:09:35 +0800 (CST) Subject: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. Message-ID: <1340878175.7173.YahooMailClassic@web92207.mail.cnh.yahoo.com> hi all:? ?I encounter a error when i want to use torque/blcr to chedkpoint my job.? ?I submit a job with qsub -c enabled ./crtest and hold it with qhold job_ID? ?job error file saied: ? ?ERROR: ld.so: object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. and /var/log/messages saied: ??Jun 28 17:32:09 node8 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 24946 55.node8 pbs pbs /var/spool/torque/checkpoint/55.node8.CK ckpt.55.node8.1340875929 15 -?Jun 28 17:32:09 node8 kernel: blcr: Retry request on -CR_ENOSUPPORTJun 28 17:32:09 node8 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 24946 --file ckpt.55.node8.1340875929) failed with rc=52: - Retry request on -CR_ENOSUPPORT Checkpoint failed: support missing from application?Jun 28 17:32:09 node8 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52? it's sound like that i do not export BLCR library in my LD_LIBRARY_PATH,but i do setup my user enviroment and I can cr_run,cr_checkpoint and cr_restart in command line.I add $ENV{LD_LIBRARY_PATH} = "blcr_libpath" ; in checkpoint_script and the error still exist.?I use blcr-0.8.4 and torque-2.4.16so is someone help, -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120628/4b81e55f/attachment.html From barbarak at arnes.si Fri Jun 29 07:30:48 2012 From: barbarak at arnes.si (Barbara Krasovec) Date: Fri, 29 Jun 2012 15:30:48 +0200 Subject: [torqueusers] invalid credentials torque 2.5.7 Message-ID: <4FEDAE08.8020903@arnes.si> Hello! I am using torque-server-2.5.7-9.el6.x86_64 and munge-0.5.10-1.el6.x86_64. I defined authorized users: eg. set server authorized_users = *@hostname when I try to do qstat or qsub from that hostname (not the same as torque server), i get: qstat: Invalid credential After when I do a ssh to the torque server machine with that user, and then qstat again, it works.. But only for a short time. Any ideas? Thank you, Barbara From peter.ruprecht at Colorado.EDU Fri Jun 29 09:09:52 2012 From: peter.ruprecht at Colorado.EDU (Peter A Ruprecht) Date: Fri, 29 Jun 2012 09:09:52 -0600 Subject: [torqueusers] OpenMPI and version changed to Torque Message-ID: Hi everyone, Currently we're using torque 2.5.11 and would like to migrate to 4.x pretty soon. However, some testing with 4.0.2 has shown that programs linked against a version of OpenMPI (1.4.x) that was compiled with torque 2.5 won't run across more than one node. My guess is that the task manager API has changed between 2.5 and 4.0. Certainly, best practices would suggest recompiling all libraries that depend on torque when the torque version changes. However, a significant number of our users would be very unhappy having to re-test and possibly recompile their codes with a recompiled OpenMPI. I think that in some cases they are even required to use identical libraries across a whole suite of runs to guarantee consistency. This makes it a little tough to ever change the resource manager. So, getting around to my questions, is it likely that I am understanding the dependency between torque, the task manager, and OpenMPI correctly? And if so, is it really going to be necessary to recompile OpenMPI? What do you all do in this situation? Is it a bad idea to run torque (on a big cluster, ~1400 nodes and >10000 jobs/day) without using the task manager? Any commentary or pointers to relevant documentation appreciated! Pete Ruprecht From knielson at adaptivecomputing.com Fri Jun 29 09:50:00 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 29 Jun 2012 09:50:00 -0600 Subject: [torqueusers] OpenMPI and version changed to Torque In-Reply-To: References: Message-ID: On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht < peter.ruprecht at colorado.edu> wrote: > Hi everyone, > > Currently we're using torque 2.5.11 and would like to migrate to 4.x > pretty soon. However, some testing with 4.0.2 has shown that programs > linked against a version of OpenMPI (1.4.x) that was compiled with torque > 2.5 won't run across more than one node. My guess is that the task > manager API has changed between 2.5 and 4.0. > The API did not change. But we did require newer versions of autotools to configure. I wonder if this could be affecting things. Ken Nielson Adaptive Computing > > Certainly, best practices would suggest recompiling all libraries that > depend on torque when the torque version changes. However, a significant > number of our users would be very unhappy having to re-test and possibly > recompile their codes with a recompiled OpenMPI. I think that in some > cases they are even required to use identical libraries across a whole > suite of runs to guarantee consistency. This makes it a little tough to > ever change the resource manager. > > So, getting around to my questions, is it likely that I am understanding > the dependency between torque, the task manager, and OpenMPI correctly? > And if so, is it really going to be necessary to recompile OpenMPI? What > do you all do in this situation? Is it a bad idea to run torque (on a big > cluster, ~1400 nodes and >10000 jobs/day) without using the task manager? > > Any commentary or pointers to relevant documentation appreciated! > > Pete Ruprecht > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120629/1da29df3/attachment.html From sm4082 at nyu.edu Fri Jun 29 10:02:18 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Fri, 29 Jun 2012 12:02:18 -0400 Subject: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. In-Reply-To: <1340878175.7173.YahooMailClassic@web92207.mail.cnh.yahoo.com> References: <1340878175.7173.YahooMailClassic@web92207.mail.cnh.yahoo.com> Message-ID: <25250AA9-F055-4369-AED8-7DC883F6282F@nyu.edu> Just making sure: You have installed BLCR on all the nodes. Right? Sreedhar. On Jun 28, 2012, at 6:09 AM, TingtingYang wrote: > > hi all: > I encounter a error when i want to use torque/blcr to chedkpoint my job. > I submit a job with qsub -c enabled ./crtest and hold it with qhold job_ID > job error file saied: > ERROR: ld.so: object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. > > and /var/log/messages saied: > > Jun 28 17:32:09 node8 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 24946 55.node8 pbs pbs /var/spool/torque/checkpoint/55.node8.CK ckpt.55.node8.1340875929 15 - > Jun 28 17:32:09 node8 kernel: blcr: Retry request on -CR_ENOSUPPORT > Jun 28 17:32:09 node8 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 24946 --file ckpt.55.node8.1340875929) failed with rc=52: - Retry request on -CR_ENOSUPPORT Checkpoint failed: support missing from application > Jun 28 17:32:09 node8 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52 > > it's sound like that i do not export BLCR library in my LD_LIBRARY_PATH,but i do setup my user enviroment and I can cr_run,cr_checkpoint and cr_restart in command line. > I add $ENV{LD_LIBRARY_PATH} = "blcr_libpath" ; in checkpoint_script and the error still exist. > I use blcr-0.8.4 and torque-2.4.16 > so is someone help, > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120629/f1dc93d2/attachment-0001.html From dbeer at adaptivecomputing.com Fri Jun 29 10:12:33 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 29 Jun 2012 10:12:33 -0600 Subject: [torqueusers] OpenMPI and version changed to Torque In-Reply-To: References: Message-ID: Peter, I am under the impression that the different sites running 4.x (either on test or production systems) haven't had to recompile their version of MPI. It'd be nice to hear input from different admins on this subject, but my impression is that this isn't necessary, and I know that we didn't change the tm interface. I will respond to some of your other questions below. On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht < peter.ruprecht at colorado.edu> wrote: > Hi everyone, > > Currently we're using torque 2.5.11 and would like to migrate to 4.x > pretty soon. However, some testing with 4.0.2 has shown that programs > linked against a version of OpenMPI (1.4.x) that was compiled with torque > 2.5 won't run across more than one node. My guess is that the task > manager API has changed between 2.5 and 4.0. > > Certainly, best practices would suggest recompiling all libraries that > depend on torque when the torque version changes. However, a significant > number of our users would be very unhappy having to re-test and possibly > recompile their codes with a recompiled OpenMPI. I think that in some > cases they are even required to use identical libraries across a whole > suite of runs to guarantee consistency. This makes it a little tough to > ever change the resource manager. > > So, getting around to my questions, is it likely that I am understanding > the dependency between torque, the task manager, and OpenMPI correctly? > My two cents: it seems extremely unlikely that if you recompile your MPI version it would change the results of the job, especially if you recompile the same version of MPI. In the event that you have to recompile, it seems like overkill to make everyone re-test their applications. However, I'm by no means an expert in being an admin for HPC systems (I am a TORQUE developer) so hopefully some more in the community can weigh in. > And if so, is it really going to be necessary to recompile OpenMPI? What > do you all do in this situation? Is it a bad idea to run torque (on a big > cluster, ~1400 nodes and >10000 jobs/day) without using the task manager? > > There are a lot of sites that use (at least occasionally) versions of MPI that don't interface with TORQUE, or haven't been built to interface with TORQUE. The most common complaint I've heard from this is that sometimes they have stray processes left from jobs that don't get cleaned up up by the mom because the mom isn't told when they are launched. Others may have more input here. > Any commentary or pointers to relevant documentation appreciated! > > Pete Ruprecht > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120629/95d6a911/attachment.html From akohlmey at cmm.chem.upenn.edu Fri Jun 29 10:54:27 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Fri, 29 Jun 2012 12:54:27 -0400 Subject: [torqueusers] OpenMPI and version changed to Torque In-Reply-To: References: Message-ID: On Fri, Jun 29, 2012 at 12:12 PM, David Beer wrote: > Peter, > > I am under the impression that the different sites running 4.x (either on > test or production systems) haven't had to recompile their version of MPI. > It'd be nice to hear input from different admins on this subject, but my > impression is that this isn't necessary, and I know that we didn't change > the tm interface. I will respond to some of your other questions below. you did change the soname of libtorque.so, right? this is likely what keeps OpenMP from failing, since the corresponding plugin won't load anymore. here is the ldd output from an OpenMPI 1.4.x installation on a Torque 2.5.5 machine: [akohlmey at login2 openmpi]$ ldd mca_ras_tm.so linux-vdso.so.1 => (0x00002aaaaaacb000) libtorque.so.2 => /opt/torque-2.5.5/lib/libtorque.so.2 (0x00002aaaaaed0000) libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaab1f1000) libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab40a000) libm.so.6 => /lib64/libm.so.6 (0x00002aaaab60d000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab892000) libc.so.6 => /lib64/libc.so.6 (0x00002aaaabaae000) /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000) if there are no changes in the ABI (of what OpenMPI uses), the workaround for keeping OpenMPI happy and working may be as simple as doing a symlink from libtorque.so.4 to libtorque.so.2. alternatively, i would try to recompile/relink only the one plugin and (mca_ras_tm.so) and replace it in the OpenMPI installation. OpenMPI has a very modular structure and none of the application binaries reference the plugin dependencies unless OpenMPI was compile for static linkage only. HTH, axel. > > On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht > wrote: >> >> Hi everyone, >> >> Currently we're using torque 2.5.11 and would like to migrate to 4.x >> pretty soon. ?However, some testing with 4.0.2 has shown that programs >> linked against a version of OpenMPI (1.4.x) that was compiled with torque >> 2.5 won't run across more than one node. ?My guess is that the task >> manager API has changed between 2.5 and 4.0. >> >> Certainly, best practices would suggest recompiling all libraries that >> depend on torque when the torque version changes. ?However, a significant >> number of our users would be very unhappy having to re-test and possibly >> recompile their codes with a recompiled OpenMPI. ?I think that in some >> cases they are even required to use identical libraries across a whole >> suite of runs to guarantee consistency. ?This makes it a little tough to >> ever change the resource manager. >> >> So, getting around to my questions, is it likely that I am understanding >> the dependency between torque, the task manager, and OpenMPI correctly? > > > My two cents: it seems extremely unlikely that if you recompile your MPI > version it would change the results of the job, especially if you recompile > the same version of MPI. In the event that you have to recompile, it seems > like overkill to make everyone re-test their applications. However, I'm by > no means an expert in being an admin for HPC systems (I am a TORQUE > developer) so hopefully some more in the community can weigh in. > >> >> And if so, is it really going to be necessary to recompile OpenMPI? ?What >> do you all do in this situation? ?Is it a bad idea to run torque (on a big >> cluster, ~1400 nodes and >10000 jobs/day) without using the task manager? >> > > There are a lot of sites that use (at least occasionally) versions of MPI > that don't interface with TORQUE, or haven't been built to interface with > TORQUE. The most common complaint I've heard from this is that sometimes > they have stray processes left from jobs that don't get cleaned up up by the > mom because the mom isn't told when they are launched. Others may have more > input here. > >> >> Any commentary or pointers to relevant documentation appreciated! >> >> Pete Ruprecht >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From peter.ruprecht at Colorado.EDU Fri Jun 29 11:07:30 2012 From: peter.ruprecht at Colorado.EDU (Peter A Ruprecht) Date: Fri, 29 Jun 2012 11:07:30 -0600 Subject: [torqueusers] OpenMPI and version changed to Torque In-Reply-To: Message-ID: On 6/29/12 10:54 AM, "Axel Kohlmeyer" wrote: >On Fri, Jun 29, 2012 at 12:12 PM, David Beer > wrote: >> Peter, >> >> I am under the impression that the different sites running 4.x (either >>on >> test or production systems) haven't had to recompile their version of >>MPI. >> It'd be nice to hear input from different admins on this subject, but my >> impression is that this isn't necessary, and I know that we didn't >>change >> the tm interface. I will respond to some of your other questions below. > >you did change the soname of libtorque.so, right? > >this is likely what keeps OpenMP from failing, >since the corresponding plugin won't load anymore. >here is the ldd output from an OpenMPI 1.4.x installation >on a Torque 2.5.5 machine: > >[akohlmey at login2 openmpi]$ ldd mca_ras_tm.so > linux-vdso.so.1 => (0x00002aaaaaacb000) > libtorque.so.2 => /opt/torque-2.5.5/lib/libtorque.so.2 >(0x00002aaaaaed0000) > libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaab1f1000) > libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab40a000) > libm.so.6 => /lib64/libm.so.6 (0x00002aaaab60d000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab892000) > libc.so.6 => /lib64/libc.so.6 (0x00002aaaabaae000) > /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000) > >if there are no changes in the ABI (of what OpenMPI uses), >the workaround for keeping OpenMPI happy and working >may be as simple as doing a symlink from libtorque.so.4 >to libtorque.so.2. > >alternatively, i would try to recompile/relink only the one plugin >and (mca_ras_tm.so) and replace it in the OpenMPI installation. > >OpenMPI has a very modular structure and none of the application >binaries reference the plugin dependencies unless OpenMPI was >compile for static linkage only. > >HTH, > axel. Thanks Axel, great suggestions. The MPI development experts here assured me that the ldd output looked correct when they were testing, but I'll double-check that. Since the t-m interface hasn't changed (thanks for the quick responses Ken and David) I'll start looking for possible installation-related problems. -Pete From siegert at sfu.ca Fri Jun 29 11:18:40 2012 From: siegert at sfu.ca (Martin Siegert) Date: Fri, 29 Jun 2012 10:18:40 -0700 Subject: [torqueusers] OpenMPI and version changed to Torque In-Reply-To: References: Message-ID: <20120629171840.GA16898@stikine.sfu.ca> Hi David, Peter, I can confirm Peter's observation - see my email to torquedev from May 28. It is necessary to recompile OpenMPI's mca_plm_tm.so (which does make upgrading difficult). Cheers, Martin -- Martin Siegert Simon Fraser University Burnaby, British Columbia Canada On Fri, Jun 29, 2012 at 10:12:33AM -0600, David Beer wrote: > > Peter, > > I am under the impression that the different sites running 4.x (either > on test or production systems) haven't had to recompile their version > of MPI. It'd be nice to hear input from different admins on this > subject, but my impression is that this isn't necessary, and I know > that we didn't change the tm interface. I will respond to some of your > other questions below. > On Fri, Jun 29, 2012 at 9:09 AM, Peter A Ruprecht > <[1]peter.ruprecht at colorado.edu> wrote: > > Hi everyone, > Currently we're using torque 2.5.11 and would like to migrate to 4.x > pretty soon. However, some testing with 4.0.2 has shown that > programs > linked against a version of OpenMPI (1.4.x) that was compiled with > torque > 2.5 won't run across more than one node. My guess is that the task > manager API has changed between 2.5 and 4.0. > Certainly, best practices would suggest recompiling all libraries > that > depend on torque when the torque version changes. However, a > significant > number of our users would be very unhappy having to re-test and > possibly > recompile their codes with a recompiled OpenMPI. I think that in > some > cases they are even required to use identical libraries across a > whole > suite of runs to guarantee consistency. This makes it a little > tough to > ever change the resource manager. > So, getting around to my questions, is it likely that I am > understanding > the dependency between torque, the task manager, and OpenMPI > correctly? > > My two cents: it seems extremely unlikely that if you recompile your > MPI version it would change the results of the job, especially if you > recompile the same version of MPI. In the event that you have to > recompile, it seems like overkill to make everyone re-test their > applications. However, I'm by no means an expert in being an admin for > HPC systems (I am a TORQUE developer) so hopefully some more in the > community can weigh in. > > And if so, is it really going to be necessary to recompile OpenMPI? > What > do you all do in this situation? Is it a bad idea to run torque (on > a big > cluster, ~1400 nodes and >10000 jobs/day) without using the task > manager? > > There are a lot of sites that use (at least occasionally) versions of > MPI that don't interface with TORQUE, or haven't been built to > interface with TORQUE. The most common complaint I've heard from this > is that sometimes they have stray processes left from jobs that don't > get cleaned up up by the mom because the mom isn't told when they are > launched. Others may have more input here. > > Any commentary or pointers to relevant documentation appreciated! > Pete Ruprecht > _______________________________________________ > torqueusers mailing list > [2]torqueusers at supercluster.org > [3]http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > David Beer | Software Engineer > Adaptive Computing > > References > > 1. mailto:peter.ruprecht at colorado.edu > 2. mailto:torqueusers at supercluster.org > 3. http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Fri Jun 29 12:09:10 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 29 Jun 2012 11:09:10 -0700 Subject: [torqueusers] OpenMPI and version changed to Torque In-Reply-To: Message-ID: <20120629180909.GN6259@lbl.gov> On Friday, 29 June 2012, at 09:50:00 (-0600), Ken Nielson wrote: > The API did not change. But we did require newer versions of > autotools to configure. I wonder if this could be affecting things. No, that's not possible. The autotools versions are only relevant to the process of doing the build. They have no bearing on the resultant ABI. On Friday, 29 June 2012, at 12:54:27 (-0400), Axel Kohlmeyer wrote: > you did change the soname of libtorque.so, right? > > this is likely what keeps OpenMP from failing, > since the corresponding plugin won't load anymore. > here is the ldd output from an OpenMPI 1.4.x installation > on a Torque 2.5.5 machine: > > [akohlmey at login2 openmpi]$ ldd mca_ras_tm.so > linux-vdso.so.1 => (0x00002aaaaaacb000) > libtorque.so.2 => /opt/torque-2.5.5/lib/libtorque.so.2 (0x00002aaaaaed0000) > libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaab1f1000) > libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab40a000) > libm.so.6 => /lib64/libm.so.6 (0x00002aaaab60d000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab892000) > libc.so.6 => /lib64/libc.so.6 (0x00002aaaabaae000) > /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000) > > if there are no changes in the ABI (of what OpenMPI uses), > the workaround for keeping OpenMPI happy and working > may be as simple as doing a symlink from libtorque.so.4 > to libtorque.so.2. torque 2.5.x and 4.0.x both seem to supply libtorque.so.2 as far as I can tell. Are you seeing something different? It's worth pointing out that shared object versioning is *radically* different from package versioning, and moving from libtorque.so.2 to libtorque.so.4 merely to keep the version numbers in sync would be unwise. But it doesn't appear that's what happened, unless I'm missing something. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From ytt515 at yahoo.cn Sat Jun 30 19:20:34 2012 From: ytt515 at yahoo.cn (TingtingYang) Date: Sun, 1 Jul 2012 09:20:34 +0800 (CST) Subject: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. In-Reply-To: <25250AA9-F055-4369-AED8-7DC883F6282F@nyu.edu> Message-ID: <1341105634.93549.YahooMailClassic@web92203.mail.cnh.yahoo.com> thank you for your reply,I only installed blcr and torque on one node,is it OK? tingting.yang --- 12?6?30????, Sreedhar Manchu ??? ???: Sreedhar Manchu ??: Re: [torqueusers] torque/blcr object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. ???: "Torque Users Mailing List" ??: 2012?6?30?,??,??12:02 Just making sure: You have installed BLCR on all the nodes. Right? Sreedhar. On Jun 28, 2012, at 6:09 AM, TingtingYang wrote: hi all:? ?I encounter a error when i want to use torque/blcr to chedkpoint my job.? ?I submit a job with qsub -c enabled ./crtest and hold it with qhold job_ID? ?job error file saied: ? ?ERROR: ld.so: object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored. and /var/log/messages saied: ??Jun 28 17:32:09 node8 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 24946 55.node8 pbs pbs /var/spool/torque/checkpoint/55.node8.CK ckpt.55.node8.1340875929 15 -?Jun 28 17:32:09 node8 kernel: blcr: Retry request on -CR_ENOSUPPORTJun 28 17:32:09 node8 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 24946 --file ckpt.55.node8.1340875929) failed with rc=52: - Retry request on -CR_ENOSUPPORT Checkpoint failed: support missing from application?Jun 28 17:32:09 node8 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52? it's sound like that i do not export BLCR library in my LD_LIBRARY_PATH,but i do setup my user enviroment and I can cr_run,cr_checkpoint and cr_restart in command line.I add $ENV{LD_LIBRARY_PATH} = "blcr_libpath" ; in checkpoint_script and the error still exist.?I use blcr-0.8.4 and torque-2.4.16so is someone help,_______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -----???????----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120701/30e938c7/attachment.html