[torqueusers] torqueusers Digest, Vol 88, Issue 26

RB. Ezhilalan (Principal Physicist, CUH) RB.Ezhilalan at hse.ie
Mon Nov 28 02:11:37 MST 2011


Hi Ken,

Thanks for your suggestion, currently I set ...ncpus=2 and now the job
can able to run on the two PCs. However I'll try to use the setting
suggested by you when we set up a small cluster using number of PCs.

Regards,
Ezhilalan

Ezhilalan Ramalingam M.Sc.,DABR.,
Principal Physicist (Radiotherapy),
Medical Physics Department,
Cork University Hospital,
Wilton, Cork
Ireland
Tel. 00353 21 4922533
Fax.00353 21 4921300
Email: rb.ezhilalan at hse.ie 
-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of
torqueusers-request at supercluster.org
Sent: 25 November 2011 19:05
To: torqueusers at supercluster.org
Subject: torqueusers Digest, Vol 88, Issue 26

Send torqueusers mailing list submissions to
	torqueusers at supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
	torqueusers-request at supercluster.org

You can reach the person managing the list at
	torqueusers-owner at supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torqueusers digest..."


Today's Topics:

   1. Issue when upgrading torque from 2.5.7 to 3.0.3
      (Fabien Archambault)
   2. Re: torqueusers Digest, Vol 88, Issue 16 (Ken Nielson)


----------------------------------------------------------------------

Message: 1
Date: Thu, 24 Nov 2011 10:39:30 +0100
From: Fabien Archambault <fabien.archambault at univ-provence.fr>
Subject: [torqueusers] Issue when upgrading torque from 2.5.7 to 3.0.3
To: torqueusers at supercluster.org
Message-ID: <4ECE10D2.7070106 at univ-provence.fr>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Dear torque list,

Yesterday I tried to update a torque installation from 2.5.7 to 3.0.3 in

order, at minimum, to activate cpuset. I compiled torque on the master 
with the same options as before (with --enable-cpuset) and the same on a

node (different architecture from the master). I also pushed all 
packages (torque-package-clients-linux-x86_64.sh  
torque-package-devel-linux-x86_64.sh  
torque-package-doc-linux-x86_64.sh  torque-package-mom-linux-x86_64.sh) 
to the nodes.

Then I backed-up my configuration and prayed for a successful update...
In order to update I made (CentOS 5):
- set all nodes offline
- stop pbs_server
- stop maui.d (just in case)
- stop pbs_mom on all nodes
- make install on the master
- package --install on all nodes
- start pbs_mom on all nodes
- start maui.d
- start pbs_server
- set all nodes online

First thing, all nodes were still offline. I had some messages in 
server_logs saying that it receives information from version 1 instead 
of version 2. I checked and pbs_server --version on master and pbs_mom 
--version on nodes were 3.0.3.
What does this message meant?

Also I had issues, perhaps related, that was saying impossible to 
communicate to port 0. It did not go through the right port.
Is there in version 3.x.x special directives to add for the 
communication port?

Seeing that it could not work well I re-installed back to the 2.5.7...
Do you think it is possible to update torque to 3.x.x without issues, 
did I miss something or is it better to update to 2.5.9?

Thank you for any reply,
Fabien Archambault


------------------------------

Message: 2
Date: Fri, 25 Nov 2011 12:04:49 -0700 (MST)
From: Ken Nielson <knielson at adaptivecomputing.com>
Subject: Re: [torqueusers] torqueusers Digest, Vol 88, Issue 16
To: Torque Users Mailing List <torqueusers at supercluster.org>
Message-ID: <8f257a9c-64b9-4a66-a529-ed23c70b8eff at mail>
Content-Type: text/plain; charset=utf-8



----- Original Message -----
> From: "RB. Ezhilalan (Principal Physicist, CUH)" <RB.Ezhilalan at hse.ie>
> To: torqueusers at supercluster.org
> Sent: Friday, November 18, 2011 9:44:30 AM
> Subject: Re: [torqueusers] torqueusers Digest, Vol 88, Issue 16
> 
> Jason,
> 
> I had linux-01 np=1, linux-02 np=1 in the nodes file, despite this,
> the
> job ran on one core (linux-01) only. Then I removed the 'np' option
> from
> the list under the notion, the system will 'autodetect' the cores.
> 
> Ezhilalan
> 
> Ezhilalan Ramalingam M.Sc.,DABR.,
> Principal Physicist (Radiotherapy),
> Medical Physics Department,
> Cork University Hospital,
> Wilton, Cork
> Ireland
> Tel. 00353 21 4922533
> Fax.00353 21 4921300
> Email: rb.ezhilalan at hse.ie

If you set the server parameter auto_node_np=TRUE TORQUE will
automatically detect core counts.

Ken Nielson
Adaptive Computing
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of
> torqueusers-request at supercluster.org
> Sent: 18 November 2011 16:12
> To: torqueusers at supercluster.org
> Subject: torqueusers Digest, Vol 88, Issue 16
> 
> Send torqueusers mailing list submissions to
> 	torqueusers at supercluster.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://www.supercluster.org/mailman/listinfo/torqueusers
> or, via email, send a message with subject or body 'help' to
> 	torqueusers-request at supercluster.org
> 
> You can reach the person managing the list at
> 	torqueusers-owner at supercluster.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of torqueusers digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Parallel processing for MC code (Jason Bacon)
>    2. Re: procs= not working as documented (Lance Westerhoff)
>    3. Re: procs= not working as documented (Steve Crusan)
>    4. Re: procs= not working as documented (Lance Westerhoff)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 18 Nov 2011 07:57:23 -0600
> From: Jason Bacon <jwbacon at tds.net>
> Subject: Re: [torqueusers] Parallel processing for MC code
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <4EC66443.3080608 at tds.net>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> 
> I was only wondering if you had "np=2" in the Linux-01 entry, or if
> Torque was configured to autodetect the number of cores and there
> were
> two.  That would have explained the scheduling behavior.
> 
> Regards,
> 
>      -J
> 
> On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote:
> >
> > Hi Jason,
> >
> > PC1 (linux-01) is a single core PC like PC2, I defined the
> > server_priv/nodes file as;
> >
> > Linux-01
> >
> > Linux-02
> >
> > As you have mentioned may be resource requirement needs to be
> > properly
> 
> > set up. Do you have any suggestions?
> >
> > Many thanks,
> >
> > Ezhilalan
> >
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org
> > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of
> > torqueusers-request at supercluster.org
> > Sent: 17 November 2011 17:20
> > To: torqueusers at supercluster.org
> > Subject: torqueusers Digest, Vol 88, Issue 14
> >
> > Send torqueusers mailing list submissions to
> >
> > torqueusers at supercluster.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >
> >       http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > or, via email, send a message with subject or body 'help' to
> >
> >       torqueusers-request at supercluster.org
> >
> > You can reach the person managing the list at
> >
> >       torqueusers-owner at supercluster.org
> >
> > When replying, please edit your Subject line so it is more specific
> >
> > than "Re: Contents of torqueusers digest..."
> >
> > Today's Topics:
> >
> >    1. Re: Random SCP errors when transfering to/from  CREAM sandbox
> >
> >       (Christopher Samuel)
> >
> >    2. Re: Random SCP errors when transfering to/from  CREAM sandbox
> >
> >       (Gila Arrondo  Miguel Angel)
> >
> >    3. Parallel processing for MC code
> >
> >       (RB. Ezhilalan (Principal Physicist, CUH))
> >
> >    4. Re: Parallel processing for MC code (Jason Bacon)
> >
> >    5. Re: File staging syntax (Steve Traylen)
> >
> >
----------------------------------------------------------------------
> >
> > Message: 1
> >
> > Date: Thu, 17 Nov 2011 13:29:44 +1100
> >
> > From: Christopher Samuel <samuel at unimelb.edu.au>
> >
> > Subject: Re: [torqueusers] Random SCP errors when transfering
> > to/from
> >
> >       CREAM sandbox
> >
> > To: torqueusers at supercluster.org
> >
> > Message-ID: <4EC47198.1040709 at unimelb.edu.au>
> >
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> >
> > Hash: SHA1
> >
> > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote:
> >
> > > Many thanks for your answer. We've made sure that the
> >
> > > keys are okay, as well as disabling hoskeychecking to
> >
> > > test it.
> >
> > Can you try and scp as that user to see whether it
> >
> > complains about anything else ?
> >
> > It may be that it is prompting the user to accept a
> >
> > host key if they don't already have it.
> >
> > cheers,
> >
> > Chris
> >
> > - --
> >
> >     Christopher Samuel - Senior Systems Administrator
> >
> >  VLSCI - Victorian Life Sciences Computation Initiative
> >
> >  Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
> >
> >          http://www.vlsci.unimelb.edu.au/
> >
> > -----BEGIN PGP SIGNATURE-----
> >
> > Version: GnuPG v1.4.11 (GNU/Linux)
> >
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> >
> > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW
> >
> > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS
> >
> > =VPqK
> >
> > -----END PGP SIGNATURE-----
> >
> > ------------------------------
> >
> > Message: 2
> >
> > Date: Thu, 17 Nov 2011 07:55:50 +0000
> >
> > From: "Gila Arrondo  Miguel Angel" <miguel.gila at cscs.ch>
> >
> > Subject: Re: [torqueusers] Random SCP errors when transfering
> > to/from
> >
> >       CREAM sandbox
> >
> > To: Torque Users Mailing List <torqueusers at supercluster.org>
> >
> > Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch>
> >
> > Content-Type: text/plain; charset="us-ascii"
> >
> > Hi Chris,
> >
> > I've done that in many WNs and with different users, so I don't
> > think
> > that is be the issue. I've also checked for scheduled tasks that
> > interact with the ssh keys, but the errors happen at random times,
> > not
> 
> > when the scheduled tasks run... :-S
> >
> > I'm running out of options here.
> >
> > Cheers,
> >
> > Miguel
> >
> > On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote:
> >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> >
> > > Hash: SHA1
> >
> > >
> >
> > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote:
> >
> > >
> >
> > >> Many thanks for your answer. We've made sure that the
> >
> > >> keys are okay, as well as disabling hoskeychecking to
> >
> > >> test it.
> >
> > >
> >
> > > Can you try and scp as that user to see whether it
> >
> > > complains about anything else ?
> >
> > >
> >
> > > It may be that it is prompting the user to accept a
> >
> > > host key if they don't already have it.
> >
> > >
> >
> > > cheers,
> >
> > > Chris
> >
> > > - --
> >
> > >    Christopher Samuel - Senior Systems Administrator
> >
> > > VLSCI - Victorian Life Sciences Computation Initiative
> >
> > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
> >
> > >         http://www.vlsci.unimelb.edu.au/
> >
> > >
> >
> > > -----BEGIN PGP SIGNATURE-----
> >
> > > Version: GnuPG v1.4.11 (GNU/Linux)
> >
> > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> >
> > >
> >
> > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW
> >
> > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS
> >
> > > =VPqK
> >
> > > -----END PGP SIGNATURE-----
> >
> > > _______________________________________________
> >
> > > torqueusers mailing list
> >
> > > torqueusers at supercluster.org
> >
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > --
> >
> > Miguel Gila
> >
> > CSCS Swiss National Supercomputing Centre
> >
> > HPC Solutions
> >
> > Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
> >
> > miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22
> >
> > -------------- next part --------------
> >
> > A non-text attachment was scrubbed...
> >
> > Name: smime.p7s
> >
> > Type: application/pkcs7-signature
> >
> > Size: 3239 bytes
> >
> > Desc: not available
> >
> > Url :
> >
>
http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2
> 14ea9d6/attachment-0001.bin
> >
> > ------------------------------
> >
> > Message: 3
> >
> > Date: Thu, 17 Nov 2011 10:14:32 -0000
> >
> > From: "RB. Ezhilalan (Principal Physicist, CUH)"
> > <RB.Ezhilalan at hse.ie>
> >
> > Subject: [torqueusers] Parallel processing for MC code
> >
> > To: torqueusers at supercluster.org
> >
> > Message-ID:
> >
> >
<DB0960F9D7310D4BA87B4985061511A703B30D96 at CKVEX004.south.health.local>
> >
> > Content-Type: text/plain; charset="us-ascii"
> >
> > Hi All,
> >
> > I've been trying to set up Torque queuing system on two SUSE10.1
> > linux
> >
> > PCs (PIII!).
> >
> > Installed the linux on both PCs, exported home directory containing
> >
> > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH
> password
> >
> > less communication. All seems to be working fine.
> >
> > Downloaded latest version of Torque (number not handy) installed
> >
> > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2.
> >
> > PBS 'nodes' file was created as per guidelines, PBS_SERVER and
> > QUEUE
> >
> > attributes were set as default.
> >
> > Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are
> > free.
> I
> >
> > am not sure whether this confirms PBS/Torque set up correctly.
> >
> > I was able to run an executable BEAMnrc user code in batch mode i.e
> >
> > using 'exb' command aliased to 'qsub' and sources a built in job
> script
> >
> > file with option p=1 (single job).
> >
> > To split the jobs in to two, so that it runs in parallel on the two
> PCs,
> >
> > option p=2 should be issued. However, what I noticed was, the job
> > ran
> >
> > twice on the first PC (PC1) but not on both.
> >
> > I can't figure out what went wrong, I suspect PBS setup could have
> some
> >
> > issues, May be I can try running the job specifically on PC2 if so
> what
> >
> > command I need to give?
> >
> > I would be grateful for any advice!
> >
> > Kind Regards,
> >
> > Ezhilalan
> >
> > -------------- next part --------------
> >
> > An HTML attachment was scrubbed...
> >
> > URL:
> >
>
http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0
> 6e4a798/attachment-0001.html
> >
> >
> > ------------------------------
> >
> > Message: 4
> >
> > Date: Thu, 17 Nov 2011 10:18:18 -0600
> >
> > From: Jason Bacon <jwbacon at tds.net>
> >
> > Subject: Re: [torqueusers] Parallel processing for MC code
> >
> > To: Torque Users Mailing List <torqueusers at supercluster.org>
> >
> > Message-ID: <4EC533CA.2000902 at tds.net>
> >
> > Content-Type: text/plain; charset=windows-1252; format=flowed
> >
> > How many cores does PC1 have? Note that Torque schedules cores, not
> >
> > computers, unless you specifically tell it to with resource
> requirements.
> >
> > Regards,
> >
> > -J
> >
> > On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote:
> >
> > >
> >
> > > Hi All,
> >
> > >
> >
> > > I?ve been trying to set up Torque queuing system on two SUSE10.1
> linux
> >
> > > PCs (PIII!).
> >
> > >
> >
> > > Installed the linux on both PCs, exported home directory
> > > containing
> >
> > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH
> >
> > > password less communication. All seems to be working fine.
> >
> > >
> >
> > > Downloaded latest version of Torque (number not handy) installed
> >
> > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2.
> >
> > >
> >
> > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and
> > > QUEUE
> >
> > > attributes were set as default.
> >
> > >
> >
> > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are
> free.
> >
> > > I am not sure whether this confirms PBS/Torque set up correctly.
> >
> > >
> >
> > > I was able to run an executable BEAMnrc user code in batch mode
> > > i.e
> >
> > > using ?exb? command aliased to ?qsub? and sources a built in job
> >
> > > script file with option p=1 (single job).
> >
> > >
> >
> > > To split the jobs in to two, so that it runs in parallel on the
> > > two
> >
> > > PCs, option p=2 should be issued. However, what I noticed was,
> > > the
> job
> >
> > > ran twice on the first PC (PC1) but not on both.
> >
> > >
> >
> > > I can?t figure out what went wrong, I suspect PBS setup could
> > > have
> >
> > > some issues, May be I can try running the job specifically on PC2
> > > if
> >
> > > so what command I need to give?
> >
> > >
> >
> > > I would be grateful for any advice!
> >
> > >
> >
> > > Kind Regards,
> >
> > >
> >
> > > Ezhilalan
> >
> > >
> >
> > >
> >
> > > _______________________________________________
> >
> > > torqueusers mailing list
> >
> > > torqueusers at supercluster.org
> >
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > --
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > Jason W. Bacon
> >
> > jwbacon at tds.net
> >
> > http://personalpages.tds.net/~jwbacon
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > ------------------------------
> >
> > Message: 5
> >
> > Date: Thu, 17 Nov 2011 18:19:14 +0100
> >
> > From: Steve Traylen <steve.traylen at cern.ch>
> >
> > Subject: Re: [torqueusers] File staging syntax
> >
> > To: Torque Users Mailing List <torqueusers at supercluster.org>
> >
> > Message-ID:
> >
> > <CAOXEVSCY2CC-=ajKvcc6PAgKd5S6fupRkgNp79KL_w3=k2Xy1A at mail.gmail.com>
> >
> > Keywords: CERN SpamKiller Note: -50
> >
> > Content-Type: text/plain; charset="ISO-8859-1"
> >
> > On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson
> >
> > <knielson at adaptivecomputing.com> wrote>
> >
> > > Andr?,
> >
> > >
> >
> > > I have not yet had time to reproduce this. I did look through the
> > change log and there are two suspects. One is in 2.5.6, a fix for
> > Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133.
> >
> > >
> >
> > > That is as far as I am right now. I will try to get to this as
> > > soon
> > as I can.
> >
> > Hi Ken,
> >
> >  Did you manage to track this down. It's currently making upgrading
> >  a
> > pain.
> >
> > Steve.
> >
> > --
> >
> > Steve Traylen
> >
> > ------------------------------
> >
> > _______________________________________________
> >
> > torqueusers mailing list
> >
> > torqueusers at supercluster.org
> >
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > End of torqueusers Digest, Vol 88, Issue 14
> >
> > *******************************************
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Jason W. Bacon
> jwbacon at tds.net
> http://personalpages.tds.net/~jwbacon
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Fri, 18 Nov 2011 09:33:12 -0500
> From: Lance Westerhoff <lance at quantumbioinc.com>
> Subject: Re: [torqueusers] procs= not working as documented
> To: torqueusers at supercluster.org
> Message-ID: <CCDE8276-0991-41D7-A719-239F7CB1C666 at quantumbioinc.com>
> Content-Type: text/plain; charset=us-ascii
> 
> The request that is placed is for procs=60. Both torque and maui see
> that there are only 53 processors available and instead of letting
> the
> job sit in the queue and wait for all 60 processors to become
> available,
> it goes ahead and runs the job with what's available. Now if the user
> could ask for procs=[50-60] where 50 is the minimum number of
> processors
> to provide and 60 is the maximum, this would be a feature. But as it
> stands, if the user asks for 60 processors and ends up with 2
> processors, the job just won't scale properly and he may as well kill
> it
> (when it shouldn't have run anyway).
> 
> I'm actually beginning to think the problem may be related to maui.
> Perhaps I'll post this same question to the maui list and see what
> comes
> back.
> 
> This problem is infuriating though since without the functionality
> working as it should, using procs=X in torque/maui makes torque/maui
> work more like a submission and run system (not a queuing system).
> 
> -Lance
> 
> 
> > 
> > Message: 3
> > Date: Thu, 17 Nov 2011 17:29:17 -0800
> > From: "Brock Palen" <brockp at umich.edu>
> > Subject: Re: [torqueusers] procs= not working as documented
> > To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> > Message-ID:
> > <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> > Content-Type: text/plain; charset="utf-8"
> > 
> > Does maui only see one cpu or does mpiexec only see one cpu?
> > 
> > 
> > 
> > Brock Palen
> > (734)936-1985
> > brockp at umich.edu
> > - Sent from my Palm Pre, please excuse typos
> > On Nov 17, 2011 3:19 PM, Lance Westerhoff
> &lt;lance at quantumbioinc.com&gt; wrote:
> > 
> > 
> > 
> > Hello All-
> > 
> > 
> > 
> > It appears that when running with the following specs, the procs=
> option does not actually work as expected.
> > 
> > 
> > 
> > ==========================================
> > 
> > 
> > 
> > #PBS -S /bin/bash
> > 
> > #PBS -l procs=60
> > 
> > #PBS -l pmem=700mb
> > 
> > #PBS -l walltime=744:00:00
> > 
> > #PBS -j oe
> > 
> > #PBS -q batch
> > 
> > 
> > 
> > torque version: tried 3.0.2. in v2.5.4, I think the procs option
> worked as documented
> > 
> > maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
> fail in terms of the procs option and it only asks for a single CPU)
> > 
> > 
> > 
> > ==========================================
> > 
> > 
> > 
> > If there are fewer then 60 processors available in the cluster (in
> this case there were 53 available) the job will go in an take
> whatever
> is left instead of waiting for all 60 processors to free up. Any
> thoughts as to why this might be happening? Sometimes it doesn't
> really
> matter and 53 would be almost as good as 60, however if only 2
> processors are available and the user asks for 60, I would hate for
> him
> to go in.
> > 
> > 
> > 
> > Thank you for your time!
> > 
> > 
> > 
> > -Lance
> > 
> > 
> > 
> > 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 18 Nov 2011 09:47:24 -0500
> From: Steve Crusan <scrusan at ur.rochester.edu>
> Subject: Re: [torqueusers] procs= not working as documented
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <B2DF69B9-AEB2-4972-8936-EE2F528D07D5 at ur.rochester.edu>
> Content-Type: text/plain; charset=us-ascii
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:
> 
> > The request that is placed is for procs=60. Both torque and maui
> > see
> that there are only 53 processors available and instead of letting
> the
> job sit in the queue and wait for all 60 processors to become
> available,
> it goes ahead and runs the job with what's available. Now if the user
> could ask for procs=[50-60] where 50 is the minimum number of
> processors
> to provide and 60 is the maximum, this would be a feature. But as it
> stands, if the user asks for 60 processors and ends up with 2
> processors, the job just won't scale properly and he may as well kill
> it
> (when it shouldn't have run anyway).
> 
> Hi Lance,
> 
> 	Can you post the output of checkjob <jobid> of an incorrectly
> running job. Let's take a look at what Maui thinks the job is asking
> for.
> 	
> 	Might as well add your maui.cfg file also.
> 
> 	I've found in the past that procs= is troublesome...
> 
> > 
> > I'm actually beginning to think the problem may be related to maui.
> Perhaps I'll post this same question to the maui list and see what
> comes
> back.
> > 
> > This problem is infuriating though since without the functionality
> working as it should, using procs=X in torque/maui makes torque/maui
> work more like a submission and run system (not a queuing system).
> 
> Agreed. HPC cluster job management is normally be set it and forget
> it.
> Anything else other than maintenance/break fixes/new features would
> be
> ridiculously time consuming.
> 
> > 
> > -Lance
> > 
> > 
> >> 
> >> Message: 3
> >> Date: Thu, 17 Nov 2011 17:29:17 -0800
> >> From: "Brock Palen" <brockp at umich.edu>
> >> Subject: Re: [torqueusers] procs= not working as documented
> >> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >> Message-ID:
> >> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> >> Content-Type: text/plain; charset="utf-8"
> >> 
> >> Does maui only see one cpu or does mpiexec only see one cpu?
> >> 
> >> 
> >> 
> >> Brock Palen
> >> (734)936-1985
> >> brockp at umich.edu
> >> - Sent from my Palm Pre, please excuse typos
> >> On Nov 17, 2011 3:19 PM, Lance Westerhoff
> &lt;lance at quantumbioinc.com&gt; wrote:
> >> 
> >> 
> >> 
> >> Hello All-
> >> 
> >> 
> >> 
> >> It appears that when running with the following specs, the procs=
> option does not actually work as expected.
> >> 
> >> 
> >> 
> >> ==========================================
> >> 
> >> 
> >> 
> >> #PBS -S /bin/bash
> >> 
> >> #PBS -l procs=60
> >> 
> >> #PBS -l pmem=700mb
> >> 
> >> #PBS -l walltime=744:00:00
> >> 
> >> #PBS -j oe
> >> 
> >> #PBS -q batch
> >> 
> >> 
> >> 
> >> torque version: tried 3.0.2. in v2.5.4, I think the procs option
> worked as documented
> >> 
> >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
> fail in terms of the procs option and it only asks for a single CPU)
> >> 
> >> 
> >> 
> >> ==========================================
> >> 
> >> 
> >> 
> >> If there are fewer then 60 processors available in the cluster (in
> this case there were 53 available) the job will go in an take
> whatever
> is left instead of waiting for all 60 processors to free up. Any
> thoughts as to why this might be happening? Sometimes it doesn't
> really
> matter and 53 would be almost as good as 60, however if only 2
> processors are available and the user asks for 60, I would hate for
> him
> to go in.
> >> 
> >> 
> >> 
> >> Thank you for your time!
> >> 
> >> 
> >> 
> >> -Lance
> >> 
> >> 
> >> 
> >> 
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
>  ----------------------
>  Steve Crusan
>  System Administrator
>  Center for Research Computing
>  University of Rochester
>  https://www.crc.rochester.edu/
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
> =AGW7
> -----END PGP SIGNATURE-----
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Fri, 18 Nov 2011 11:12:06 -0500
> From: Lance Westerhoff <lance at quantumbioinc.com>
> Subject: Re: [torqueusers] procs= not working as documented
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE at quantumbioinc.com>
> Content-Type: text/plain; charset=us-ascii
> 
> 
> Hi Steve-
> 
> Here you go. Here is the top few lines of the job script. I have then
> provided the output you requested long with the maui.cfg. If you need
> anything further, certainly please let me know.
> 
> Thanks for your help!
> 
> ===============
> 
>  + head job.pbs
> 
> #!/bin/bash
> #PBS -S /bin/bash
> #PBS -l procs=100
> #PBS -l pmem=700mb
> #PBS -l walltime=744:00:00
> #PBS -j oe
> #PBS -q batch
> 
> Report run on Fri Nov 18 10:49:38 EST 2011
> + pbsnodes --version
> version: 3.0.2
> + diagnose --version
> maui client version 3.2.6p21
> + checkjob 371010
> 
> 
> checking job 371010
> 
> State: Running
> Creds:  user:josh  group:games  class:batch  qos:DEFAULT
> WallTime: 00:02:35 of 31:00:00:00
> SubmitTime: Fri Nov 18 10:46:33
>   (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
> 
> StartTime: Fri Nov 18 10:46:34
> Total Tasks: 1
> 
> Req[0]  TaskCount: 26  Partition: DEFAULT
> Network: [NONE]  Memory >= 700M  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 700M
> NodeCount: 10
> Allocated Nodes:
> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3]
> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2]
> [compute-0-13:2][compute-0-14:2]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '371010' (-00:02:09 -> 30:23:57:51  Duration:
> 31:00:00:00)
> PE:  26.00  StartPriority:  4716
> 
> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]"
> SERVERHOST            gondor
> ADMIN1                maui root
> ADMIN3                ALL
> RMCFG[base]  TYPE=PBS
> AMCFG[bank]  TYPE=NONE
> RMPOLLINTERVAL        00:01:00
> SERVERPORT            42559
> SERVERMODE            NORMAL
> LOGFILE               maui.log
> LOGFILEMAXSIZE        10000000
> LOGLEVEL              3
> QUEUETIMEWEIGHT       1
> FSPOLICY              DEDICATEDPS
> FSDEPTH               7
> FSINTERVAL            86400
> FSDECAY               0.50
> FSWEIGHT              200
> FSUSERWEIGHT          1
> FSGROUPWEIGHT         1000
> FSQOSWEIGHT           1000
> FSACCOUNTWEIGHT       1
> FSCLASSWEIGHT         1000
> USERWEIGHT            4
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
> NODEALLOCATIONPOLICY  MINRESOURCE
> RESERVATIONDEPTH            8
> MAXJOBPERUSERPOLICY         OFF
> MAXJOBPERUSERCOUNT          8
> MAXPROCPERUSERPOLICY        OFF
> MAXPROCPERUSERCOUNT         256
> MAXPROCSECONDPERUSERPOLICY  OFF
> MAXPROCSECONDPERUSERCOUNT   36864000
> MAXJOBQUEUEDPERUSERPOLICY   OFF
> MAXJOBQUEUEDPERUSERCOUNT    2
> JOBNODEMATCHPOLICY          EXACTNODE
> NODEACCESSPOLICY            SHARED
> JOBMAXOVERRUN 99:00:00:00
> DEFERCOUNT 8192
> DEFERTIME  0
> CLASSCFG[developer] FSTARGET=40.00+
> CLASSCFG[lowprio] PRIORITY=-1000
> SRCFG[developer] CLASSLIST=developer
> SRCFG[developer] ACCESS=dedicated
> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri
> SRCFG[developer] STARTTIME=08:00:00
> SRCFG[developer] ENDTIME=18:00:00
> SRCFG[developer] TIMELIMIT=2:00:00
> SRCFG[developer] RESOURCES=PROCS(8)
> USERCFG[DEFAULT]      FSTARGET=100.0
> 
> ===============
> 
> -Lance
> 
> 
> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote:
> 
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > 
> > 
> > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:
> > 
> >> The request that is placed is for procs=60. Both torque and maui
> >> see
> that there are only 53 processors available and instead of letting
> the
> job sit in the queue and wait for all 60 processors to become
> available,
> it goes ahead and runs the job with what's available. Now if the user
> could ask for procs=[50-60] where 50 is the minimum number of
> processors
> to provide and 60 is the maximum, this would be a feature. But as it
> stands, if the user asks for 60 processors and ends up with 2
> processors, the job just won't scale properly and he may as well kill
> it
> (when it shouldn't have run anyway).
> > 
> > Hi Lance,
> > 
> > 	Can you post the output of checkjob <jobid> of an incorrectly
> running job. Let's take a look at what Maui thinks the job is asking
> for.
> > 	
> > 	Might as well add your maui.cfg file also.
> > 
> > 	I've found in the past that procs= is troublesome...
> > 
> >> 
> >> I'm actually beginning to think the problem may be related to
> >> maui.
> Perhaps I'll post this same question to the maui list and see what
> comes
> back.
> >> 
> >> This problem is infuriating though since without the functionality
> working as it should, using procs=X in torque/maui makes torque/maui
> work more like a submission and run system (not a queuing system).
> > 
> > Agreed. HPC cluster job management is normally be set it and forget
> it. Anything else other than maintenance/break fixes/new features
> would
> be ridiculously time consuming.
> > 
> >> 
> >> -Lance
> >> 
> >> 
> >>> 
> >>> Message: 3
> >>> Date: Thu, 17 Nov 2011 17:29:17 -0800
> >>> From: "Brock Palen" <brockp at umich.edu>
> >>> Subject: Re: [torqueusers] procs= not working as documented
> >>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >>> Message-ID:
> >>> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> >>> Content-Type: text/plain; charset="utf-8"
> >>> 
> >>> Does maui only see one cpu or does mpiexec only see one cpu?
> >>> 
> >>> 
> >>> 
> >>> Brock Palen
> >>> (734)936-1985
> >>> brockp at umich.edu
> >>> - Sent from my Palm Pre, please excuse typos
> >>> On Nov 17, 2011 3:19 PM, Lance Westerhoff
> &lt;lance at quantumbioinc.com&gt; wrote:
> >>> 
> >>> 
> >>> 
> >>> Hello All-
> >>> 
> >>> 
> >>> 
> >>> It appears that when running with the following specs, the procs=
> option does not actually work as expected.
> >>> 
> >>> 
> >>> 
> >>> ==========================================
> >>> 
> >>> 
> >>> 
> >>> #PBS -S /bin/bash
> >>> 
> >>> #PBS -l procs=60
> >>> 
> >>> #PBS -l pmem=700mb
> >>> 
> >>> #PBS -l walltime=744:00:00
> >>> 
> >>> #PBS -j oe
> >>> 
> >>> #PBS -q batch
> >>> 
> >>> 
> >>> 
> >>> torque version: tried 3.0.2. in v2.5.4, I think the procs option
> worked as documented
> >>> 
> >>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a
> >>> complete
> fail in terms of the procs option and it only asks for a single CPU)
> >>> 
> >>> 
> >>> 
> >>> ==========================================
> >>> 
> >>> 
> >>> 
> >>> If there are fewer then 60 processors available in the cluster
> >>> (in
> this case there were 53 available) the job will go in an take
> whatever
> is left instead of waiting for all 60 processors to free up. Any
> thoughts as to why this might be happening? Sometimes it doesn't
> really
> matter and 53 would be almost as good as 60, however if only 2
> processors are available and the user asks for 60, I would hate for
> him
> to go in.
> >>> 
> >>> 
> >>> 
> >>> Thank you for your time!
> >>> 
> >>> 
> >>> 
> >>> -Lance
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > ----------------------
> > Steve Crusan
> > System Administrator
> > Center for Research Computing
> > University of Rochester
> > https://www.crc.rochester.edu/
> > 
> > 
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> > Comment: GPGTools - http://gpgtools.org
> > 
> > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
> > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
> > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
> > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
> > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
> > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
> > =AGW7
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> End of torqueusers Digest, Vol 88, Issue 16
> *******************************************
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


------------------------------

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


End of torqueusers Digest, Vol 88, Issue 26
*******************************************


More information about the torqueusers mailing list