[torqueusers] torqueusers Digest, Vol 88, Issue 16

Ken Nielson knielson at adaptivecomputing.com
Fri Nov 25 12:04:49 MST 2011



----- Original Message -----
> From: "RB. Ezhilalan (Principal Physicist, CUH)" <RB.Ezhilalan at hse.ie>
> To: torqueusers at supercluster.org
> Sent: Friday, November 18, 2011 9:44:30 AM
> Subject: Re: [torqueusers] torqueusers Digest, Vol 88, Issue 16
> 
> Jason,
> 
> I had linux-01 np=1, linux-02 np=1 in the nodes file, despite this,
> the
> job ran on one core (linux-01) only. Then I removed the 'np' option
> from
> the list under the notion, the system will 'autodetect' the cores.
> 
> Ezhilalan
> 
> Ezhilalan Ramalingam M.Sc.,DABR.,
> Principal Physicist (Radiotherapy),
> Medical Physics Department,
> Cork University Hospital,
> Wilton, Cork
> Ireland
> Tel. 00353 21 4922533
> Fax.00353 21 4921300
> Email: rb.ezhilalan at hse.ie

If you set the server parameter auto_node_np=TRUE TORQUE will automatically detect core counts.

Ken Nielson
Adaptive Computing
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of
> torqueusers-request at supercluster.org
> Sent: 18 November 2011 16:12
> To: torqueusers at supercluster.org
> Subject: torqueusers Digest, Vol 88, Issue 16
> 
> Send torqueusers mailing list submissions to
> 	torqueusers at supercluster.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://www.supercluster.org/mailman/listinfo/torqueusers
> or, via email, send a message with subject or body 'help' to
> 	torqueusers-request at supercluster.org
> 
> You can reach the person managing the list at
> 	torqueusers-owner at supercluster.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of torqueusers digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Parallel processing for MC code (Jason Bacon)
>    2. Re: procs= not working as documented (Lance Westerhoff)
>    3. Re: procs= not working as documented (Steve Crusan)
>    4. Re: procs= not working as documented (Lance Westerhoff)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 18 Nov 2011 07:57:23 -0600
> From: Jason Bacon <jwbacon at tds.net>
> Subject: Re: [torqueusers] Parallel processing for MC code
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <4EC66443.3080608 at tds.net>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> 
> I was only wondering if you had "np=2" in the Linux-01 entry, or if
> Torque was configured to autodetect the number of cores and there
> were
> two.  That would have explained the scheduling behavior.
> 
> Regards,
> 
>      -J
> 
> On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote:
> >
> > Hi Jason,
> >
> > PC1 (linux-01) is a single core PC like PC2, I defined the
> > server_priv/nodes file as;
> >
> > Linux-01
> >
> > Linux-02
> >
> > As you have mentioned may be resource requirement needs to be
> > properly
> 
> > set up. Do you have any suggestions?
> >
> > Many thanks,
> >
> > Ezhilalan
> >
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org
> > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of
> > torqueusers-request at supercluster.org
> > Sent: 17 November 2011 17:20
> > To: torqueusers at supercluster.org
> > Subject: torqueusers Digest, Vol 88, Issue 14
> >
> > Send torqueusers mailing list submissions to
> >
> > torqueusers at supercluster.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >
> >       http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > or, via email, send a message with subject or body 'help' to
> >
> >       torqueusers-request at supercluster.org
> >
> > You can reach the person managing the list at
> >
> >       torqueusers-owner at supercluster.org
> >
> > When replying, please edit your Subject line so it is more specific
> >
> > than "Re: Contents of torqueusers digest..."
> >
> > Today's Topics:
> >
> >    1. Re: Random SCP errors when transfering to/from  CREAM sandbox
> >
> >       (Christopher Samuel)
> >
> >    2. Re: Random SCP errors when transfering to/from  CREAM sandbox
> >
> >       (Gila Arrondo  Miguel Angel)
> >
> >    3. Parallel processing for MC code
> >
> >       (RB. Ezhilalan (Principal Physicist, CUH))
> >
> >    4. Re: Parallel processing for MC code (Jason Bacon)
> >
> >    5. Re: File staging syntax (Steve Traylen)
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> >
> > Date: Thu, 17 Nov 2011 13:29:44 +1100
> >
> > From: Christopher Samuel <samuel at unimelb.edu.au>
> >
> > Subject: Re: [torqueusers] Random SCP errors when transfering
> > to/from
> >
> >       CREAM sandbox
> >
> > To: torqueusers at supercluster.org
> >
> > Message-ID: <4EC47198.1040709 at unimelb.edu.au>
> >
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> >
> > Hash: SHA1
> >
> > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote:
> >
> > > Many thanks for your answer. We've made sure that the
> >
> > > keys are okay, as well as disabling hoskeychecking to
> >
> > > test it.
> >
> > Can you try and scp as that user to see whether it
> >
> > complains about anything else ?
> >
> > It may be that it is prompting the user to accept a
> >
> > host key if they don't already have it.
> >
> > cheers,
> >
> > Chris
> >
> > - --
> >
> >     Christopher Samuel - Senior Systems Administrator
> >
> >  VLSCI - Victorian Life Sciences Computation Initiative
> >
> >  Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
> >
> >          http://www.vlsci.unimelb.edu.au/
> >
> > -----BEGIN PGP SIGNATURE-----
> >
> > Version: GnuPG v1.4.11 (GNU/Linux)
> >
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> >
> > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW
> >
> > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS
> >
> > =VPqK
> >
> > -----END PGP SIGNATURE-----
> >
> > ------------------------------
> >
> > Message: 2
> >
> > Date: Thu, 17 Nov 2011 07:55:50 +0000
> >
> > From: "Gila Arrondo  Miguel Angel" <miguel.gila at cscs.ch>
> >
> > Subject: Re: [torqueusers] Random SCP errors when transfering
> > to/from
> >
> >       CREAM sandbox
> >
> > To: Torque Users Mailing List <torqueusers at supercluster.org>
> >
> > Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch>
> >
> > Content-Type: text/plain; charset="us-ascii"
> >
> > Hi Chris,
> >
> > I've done that in many WNs and with different users, so I don't
> > think
> > that is be the issue. I've also checked for scheduled tasks that
> > interact with the ssh keys, but the errors happen at random times,
> > not
> 
> > when the scheduled tasks run... :-S
> >
> > I'm running out of options here.
> >
> > Cheers,
> >
> > Miguel
> >
> > On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote:
> >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> >
> > > Hash: SHA1
> >
> > >
> >
> > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote:
> >
> > >
> >
> > >> Many thanks for your answer. We've made sure that the
> >
> > >> keys are okay, as well as disabling hoskeychecking to
> >
> > >> test it.
> >
> > >
> >
> > > Can you try and scp as that user to see whether it
> >
> > > complains about anything else ?
> >
> > >
> >
> > > It may be that it is prompting the user to accept a
> >
> > > host key if they don't already have it.
> >
> > >
> >
> > > cheers,
> >
> > > Chris
> >
> > > - --
> >
> > >    Christopher Samuel - Senior Systems Administrator
> >
> > > VLSCI - Victorian Life Sciences Computation Initiative
> >
> > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
> >
> > >         http://www.vlsci.unimelb.edu.au/
> >
> > >
> >
> > > -----BEGIN PGP SIGNATURE-----
> >
> > > Version: GnuPG v1.4.11 (GNU/Linux)
> >
> > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> >
> > >
> >
> > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW
> >
> > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS
> >
> > > =VPqK
> >
> > > -----END PGP SIGNATURE-----
> >
> > > _______________________________________________
> >
> > > torqueusers mailing list
> >
> > > torqueusers at supercluster.org
> >
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > --
> >
> > Miguel Gila
> >
> > CSCS Swiss National Supercomputing Centre
> >
> > HPC Solutions
> >
> > Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
> >
> > miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22
> >
> > -------------- next part --------------
> >
> > A non-text attachment was scrubbed...
> >
> > Name: smime.p7s
> >
> > Type: application/pkcs7-signature
> >
> > Size: 3239 bytes
> >
> > Desc: not available
> >
> > Url :
> >
> http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2
> 14ea9d6/attachment-0001.bin
> >
> > ------------------------------
> >
> > Message: 3
> >
> > Date: Thu, 17 Nov 2011 10:14:32 -0000
> >
> > From: "RB. Ezhilalan (Principal Physicist, CUH)"
> > <RB.Ezhilalan at hse.ie>
> >
> > Subject: [torqueusers] Parallel processing for MC code
> >
> > To: torqueusers at supercluster.org
> >
> > Message-ID:
> >
> > <DB0960F9D7310D4BA87B4985061511A703B30D96 at CKVEX004.south.health.local>
> >
> > Content-Type: text/plain; charset="us-ascii"
> >
> > Hi All,
> >
> > I've been trying to set up Torque queuing system on two SUSE10.1
> > linux
> >
> > PCs (PIII!).
> >
> > Installed the linux on both PCs, exported home directory containing
> >
> > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH
> password
> >
> > less communication. All seems to be working fine.
> >
> > Downloaded latest version of Torque (number not handy) installed
> >
> > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2.
> >
> > PBS 'nodes' file was created as per guidelines, PBS_SERVER and
> > QUEUE
> >
> > attributes were set as default.
> >
> > Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are
> > free.
> I
> >
> > am not sure whether this confirms PBS/Torque set up correctly.
> >
> > I was able to run an executable BEAMnrc user code in batch mode i.e
> >
> > using 'exb' command aliased to 'qsub' and sources a built in job
> script
> >
> > file with option p=1 (single job).
> >
> > To split the jobs in to two, so that it runs in parallel on the two
> PCs,
> >
> > option p=2 should be issued. However, what I noticed was, the job
> > ran
> >
> > twice on the first PC (PC1) but not on both.
> >
> > I can't figure out what went wrong, I suspect PBS setup could have
> some
> >
> > issues, May be I can try running the job specifically on PC2 if so
> what
> >
> > command I need to give?
> >
> > I would be grateful for any advice!
> >
> > Kind Regards,
> >
> > Ezhilalan
> >
> > -------------- next part --------------
> >
> > An HTML attachment was scrubbed...
> >
> > URL:
> >
> http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0
> 6e4a798/attachment-0001.html
> >
> >
> > ------------------------------
> >
> > Message: 4
> >
> > Date: Thu, 17 Nov 2011 10:18:18 -0600
> >
> > From: Jason Bacon <jwbacon at tds.net>
> >
> > Subject: Re: [torqueusers] Parallel processing for MC code
> >
> > To: Torque Users Mailing List <torqueusers at supercluster.org>
> >
> > Message-ID: <4EC533CA.2000902 at tds.net>
> >
> > Content-Type: text/plain; charset=windows-1252; format=flowed
> >
> > How many cores does PC1 have? Note that Torque schedules cores, not
> >
> > computers, unless you specifically tell it to with resource
> requirements.
> >
> > Regards,
> >
> > -J
> >
> > On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote:
> >
> > >
> >
> > > Hi All,
> >
> > >
> >
> > > I?ve been trying to set up Torque queuing system on two SUSE10.1
> linux
> >
> > > PCs (PIII!).
> >
> > >
> >
> > > Installed the linux on both PCs, exported home directory
> > > containing
> >
> > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH
> >
> > > password less communication. All seems to be working fine.
> >
> > >
> >
> > > Downloaded latest version of Torque (number not handy) installed
> >
> > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2.
> >
> > >
> >
> > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and
> > > QUEUE
> >
> > > attributes were set as default.
> >
> > >
> >
> > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are
> free.
> >
> > > I am not sure whether this confirms PBS/Torque set up correctly.
> >
> > >
> >
> > > I was able to run an executable BEAMnrc user code in batch mode
> > > i.e
> >
> > > using ?exb? command aliased to ?qsub? and sources a built in job
> >
> > > script file with option p=1 (single job).
> >
> > >
> >
> > > To split the jobs in to two, so that it runs in parallel on the
> > > two
> >
> > > PCs, option p=2 should be issued. However, what I noticed was,
> > > the
> job
> >
> > > ran twice on the first PC (PC1) but not on both.
> >
> > >
> >
> > > I can?t figure out what went wrong, I suspect PBS setup could
> > > have
> >
> > > some issues, May be I can try running the job specifically on PC2
> > > if
> >
> > > so what command I need to give?
> >
> > >
> >
> > > I would be grateful for any advice!
> >
> > >
> >
> > > Kind Regards,
> >
> > >
> >
> > > Ezhilalan
> >
> > >
> >
> > >
> >
> > > _______________________________________________
> >
> > > torqueusers mailing list
> >
> > > torqueusers at supercluster.org
> >
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > --
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > Jason W. Bacon
> >
> > jwbacon at tds.net
> >
> > http://personalpages.tds.net/~jwbacon
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > ------------------------------
> >
> > Message: 5
> >
> > Date: Thu, 17 Nov 2011 18:19:14 +0100
> >
> > From: Steve Traylen <steve.traylen at cern.ch>
> >
> > Subject: Re: [torqueusers] File staging syntax
> >
> > To: Torque Users Mailing List <torqueusers at supercluster.org>
> >
> > Message-ID:
> >
> > <CAOXEVSCY2CC-=ajKvcc6PAgKd5S6fupRkgNp79KL_w3=k2Xy1A at mail.gmail.com>
> >
> > Keywords: CERN SpamKiller Note: -50
> >
> > Content-Type: text/plain; charset="ISO-8859-1"
> >
> > On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson
> >
> > <knielson at adaptivecomputing.com> wrote>
> >
> > > Andr?,
> >
> > >
> >
> > > I have not yet had time to reproduce this. I did look through the
> > change log and there are two suspects. One is in 2.5.6, a fix for
> > Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133.
> >
> > >
> >
> > > That is as far as I am right now. I will try to get to this as
> > > soon
> > as I can.
> >
> > Hi Ken,
> >
> >  Did you manage to track this down. It's currently making upgrading
> >  a
> > pain.
> >
> > Steve.
> >
> > --
> >
> > Steve Traylen
> >
> > ------------------------------
> >
> > _______________________________________________
> >
> > torqueusers mailing list
> >
> > torqueusers at supercluster.org
> >
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > End of torqueusers Digest, Vol 88, Issue 14
> >
> > *******************************************
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Jason W. Bacon
> jwbacon at tds.net
> http://personalpages.tds.net/~jwbacon
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Fri, 18 Nov 2011 09:33:12 -0500
> From: Lance Westerhoff <lance at quantumbioinc.com>
> Subject: Re: [torqueusers] procs= not working as documented
> To: torqueusers at supercluster.org
> Message-ID: <CCDE8276-0991-41D7-A719-239F7CB1C666 at quantumbioinc.com>
> Content-Type: text/plain; charset=us-ascii
> 
> The request that is placed is for procs=60. Both torque and maui see
> that there are only 53 processors available and instead of letting
> the
> job sit in the queue and wait for all 60 processors to become
> available,
> it goes ahead and runs the job with what's available. Now if the user
> could ask for procs=[50-60] where 50 is the minimum number of
> processors
> to provide and 60 is the maximum, this would be a feature. But as it
> stands, if the user asks for 60 processors and ends up with 2
> processors, the job just won't scale properly and he may as well kill
> it
> (when it shouldn't have run anyway).
> 
> I'm actually beginning to think the problem may be related to maui.
> Perhaps I'll post this same question to the maui list and see what
> comes
> back.
> 
> This problem is infuriating though since without the functionality
> working as it should, using procs=X in torque/maui makes torque/maui
> work more like a submission and run system (not a queuing system).
> 
> -Lance
> 
> 
> > 
> > Message: 3
> > Date: Thu, 17 Nov 2011 17:29:17 -0800
> > From: "Brock Palen" <brockp at umich.edu>
> > Subject: Re: [torqueusers] procs= not working as documented
> > To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> > Message-ID:
> > <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> > Content-Type: text/plain; charset="utf-8"
> > 
> > Does maui only see one cpu or does mpiexec only see one cpu?
> > 
> > 
> > 
> > Brock Palen
> > (734)936-1985
> > brockp at umich.edu
> > - Sent from my Palm Pre, please excuse typos
> > On Nov 17, 2011 3:19 PM, Lance Westerhoff
> &lt;lance at quantumbioinc.com&gt; wrote:
> > 
> > 
> > 
> > Hello All-
> > 
> > 
> > 
> > It appears that when running with the following specs, the procs=
> option does not actually work as expected.
> > 
> > 
> > 
> > ==========================================
> > 
> > 
> > 
> > #PBS -S /bin/bash
> > 
> > #PBS -l procs=60
> > 
> > #PBS -l pmem=700mb
> > 
> > #PBS -l walltime=744:00:00
> > 
> > #PBS -j oe
> > 
> > #PBS -q batch
> > 
> > 
> > 
> > torque version: tried 3.0.2. in v2.5.4, I think the procs option
> worked as documented
> > 
> > maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
> fail in terms of the procs option and it only asks for a single CPU)
> > 
> > 
> > 
> > ==========================================
> > 
> > 
> > 
> > If there are fewer then 60 processors available in the cluster (in
> this case there were 53 available) the job will go in an take
> whatever
> is left instead of waiting for all 60 processors to free up. Any
> thoughts as to why this might be happening? Sometimes it doesn't
> really
> matter and 53 would be almost as good as 60, however if only 2
> processors are available and the user asks for 60, I would hate for
> him
> to go in.
> > 
> > 
> > 
> > Thank you for your time!
> > 
> > 
> > 
> > -Lance
> > 
> > 
> > 
> > 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 18 Nov 2011 09:47:24 -0500
> From: Steve Crusan <scrusan at ur.rochester.edu>
> Subject: Re: [torqueusers] procs= not working as documented
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <B2DF69B9-AEB2-4972-8936-EE2F528D07D5 at ur.rochester.edu>
> Content-Type: text/plain; charset=us-ascii
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:
> 
> > The request that is placed is for procs=60. Both torque and maui
> > see
> that there are only 53 processors available and instead of letting
> the
> job sit in the queue and wait for all 60 processors to become
> available,
> it goes ahead and runs the job with what's available. Now if the user
> could ask for procs=[50-60] where 50 is the minimum number of
> processors
> to provide and 60 is the maximum, this would be a feature. But as it
> stands, if the user asks for 60 processors and ends up with 2
> processors, the job just won't scale properly and he may as well kill
> it
> (when it shouldn't have run anyway).
> 
> Hi Lance,
> 
> 	Can you post the output of checkjob <jobid> of an incorrectly
> running job. Let's take a look at what Maui thinks the job is asking
> for.
> 	
> 	Might as well add your maui.cfg file also.
> 
> 	I've found in the past that procs= is troublesome...
> 
> > 
> > I'm actually beginning to think the problem may be related to maui.
> Perhaps I'll post this same question to the maui list and see what
> comes
> back.
> > 
> > This problem is infuriating though since without the functionality
> working as it should, using procs=X in torque/maui makes torque/maui
> work more like a submission and run system (not a queuing system).
> 
> Agreed. HPC cluster job management is normally be set it and forget
> it.
> Anything else other than maintenance/break fixes/new features would
> be
> ridiculously time consuming.
> 
> > 
> > -Lance
> > 
> > 
> >> 
> >> Message: 3
> >> Date: Thu, 17 Nov 2011 17:29:17 -0800
> >> From: "Brock Palen" <brockp at umich.edu>
> >> Subject: Re: [torqueusers] procs= not working as documented
> >> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >> Message-ID:
> >> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> >> Content-Type: text/plain; charset="utf-8"
> >> 
> >> Does maui only see one cpu or does mpiexec only see one cpu?
> >> 
> >> 
> >> 
> >> Brock Palen
> >> (734)936-1985
> >> brockp at umich.edu
> >> - Sent from my Palm Pre, please excuse typos
> >> On Nov 17, 2011 3:19 PM, Lance Westerhoff
> &lt;lance at quantumbioinc.com&gt; wrote:
> >> 
> >> 
> >> 
> >> Hello All-
> >> 
> >> 
> >> 
> >> It appears that when running with the following specs, the procs=
> option does not actually work as expected.
> >> 
> >> 
> >> 
> >> ==========================================
> >> 
> >> 
> >> 
> >> #PBS -S /bin/bash
> >> 
> >> #PBS -l procs=60
> >> 
> >> #PBS -l pmem=700mb
> >> 
> >> #PBS -l walltime=744:00:00
> >> 
> >> #PBS -j oe
> >> 
> >> #PBS -q batch
> >> 
> >> 
> >> 
> >> torque version: tried 3.0.2. in v2.5.4, I think the procs option
> worked as documented
> >> 
> >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
> fail in terms of the procs option and it only asks for a single CPU)
> >> 
> >> 
> >> 
> >> ==========================================
> >> 
> >> 
> >> 
> >> If there are fewer then 60 processors available in the cluster (in
> this case there were 53 available) the job will go in an take
> whatever
> is left instead of waiting for all 60 processors to free up. Any
> thoughts as to why this might be happening? Sometimes it doesn't
> really
> matter and 53 would be almost as good as 60, however if only 2
> processors are available and the user asks for 60, I would hate for
> him
> to go in.
> >> 
> >> 
> >> 
> >> Thank you for your time!
> >> 
> >> 
> >> 
> >> -Lance
> >> 
> >> 
> >> 
> >> 
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
>  ----------------------
>  Steve Crusan
>  System Administrator
>  Center for Research Computing
>  University of Rochester
>  https://www.crc.rochester.edu/
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
> =AGW7
> -----END PGP SIGNATURE-----
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Fri, 18 Nov 2011 11:12:06 -0500
> From: Lance Westerhoff <lance at quantumbioinc.com>
> Subject: Re: [torqueusers] procs= not working as documented
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE at quantumbioinc.com>
> Content-Type: text/plain; charset=us-ascii
> 
> 
> Hi Steve-
> 
> Here you go. Here is the top few lines of the job script. I have then
> provided the output you requested long with the maui.cfg. If you need
> anything further, certainly please let me know.
> 
> Thanks for your help!
> 
> ===============
> 
>  + head job.pbs
> 
> #!/bin/bash
> #PBS -S /bin/bash
> #PBS -l procs=100
> #PBS -l pmem=700mb
> #PBS -l walltime=744:00:00
> #PBS -j oe
> #PBS -q batch
> 
> Report run on Fri Nov 18 10:49:38 EST 2011
> + pbsnodes --version
> version: 3.0.2
> + diagnose --version
> maui client version 3.2.6p21
> + checkjob 371010
> 
> 
> checking job 371010
> 
> State: Running
> Creds:  user:josh  group:games  class:batch  qos:DEFAULT
> WallTime: 00:02:35 of 31:00:00:00
> SubmitTime: Fri Nov 18 10:46:33
>   (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
> 
> StartTime: Fri Nov 18 10:46:34
> Total Tasks: 1
> 
> Req[0]  TaskCount: 26  Partition: DEFAULT
> Network: [NONE]  Memory >= 700M  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 700M
> NodeCount: 10
> Allocated Nodes:
> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3]
> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2]
> [compute-0-13:2][compute-0-14:2]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '371010' (-00:02:09 -> 30:23:57:51  Duration:
> 31:00:00:00)
> PE:  26.00  StartPriority:  4716
> 
> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]"
> SERVERHOST            gondor
> ADMIN1                maui root
> ADMIN3                ALL
> RMCFG[base]  TYPE=PBS
> AMCFG[bank]  TYPE=NONE
> RMPOLLINTERVAL        00:01:00
> SERVERPORT            42559
> SERVERMODE            NORMAL
> LOGFILE               maui.log
> LOGFILEMAXSIZE        10000000
> LOGLEVEL              3
> QUEUETIMEWEIGHT       1
> FSPOLICY              DEDICATEDPS
> FSDEPTH               7
> FSINTERVAL            86400
> FSDECAY               0.50
> FSWEIGHT              200
> FSUSERWEIGHT          1
> FSGROUPWEIGHT         1000
> FSQOSWEIGHT           1000
> FSACCOUNTWEIGHT       1
> FSCLASSWEIGHT         1000
> USERWEIGHT            4
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
> NODEALLOCATIONPOLICY  MINRESOURCE
> RESERVATIONDEPTH            8
> MAXJOBPERUSERPOLICY         OFF
> MAXJOBPERUSERCOUNT          8
> MAXPROCPERUSERPOLICY        OFF
> MAXPROCPERUSERCOUNT         256
> MAXPROCSECONDPERUSERPOLICY  OFF
> MAXPROCSECONDPERUSERCOUNT   36864000
> MAXJOBQUEUEDPERUSERPOLICY   OFF
> MAXJOBQUEUEDPERUSERCOUNT    2
> JOBNODEMATCHPOLICY          EXACTNODE
> NODEACCESSPOLICY            SHARED
> JOBMAXOVERRUN 99:00:00:00
> DEFERCOUNT 8192
> DEFERTIME  0
> CLASSCFG[developer] FSTARGET=40.00+
> CLASSCFG[lowprio] PRIORITY=-1000
> SRCFG[developer] CLASSLIST=developer
> SRCFG[developer] ACCESS=dedicated
> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri
> SRCFG[developer] STARTTIME=08:00:00
> SRCFG[developer] ENDTIME=18:00:00
> SRCFG[developer] TIMELIMIT=2:00:00
> SRCFG[developer] RESOURCES=PROCS(8)
> USERCFG[DEFAULT]      FSTARGET=100.0
> 
> ===============
> 
> -Lance
> 
> 
> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote:
> 
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > 
> > 
> > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:
> > 
> >> The request that is placed is for procs=60. Both torque and maui
> >> see
> that there are only 53 processors available and instead of letting
> the
> job sit in the queue and wait for all 60 processors to become
> available,
> it goes ahead and runs the job with what's available. Now if the user
> could ask for procs=[50-60] where 50 is the minimum number of
> processors
> to provide and 60 is the maximum, this would be a feature. But as it
> stands, if the user asks for 60 processors and ends up with 2
> processors, the job just won't scale properly and he may as well kill
> it
> (when it shouldn't have run anyway).
> > 
> > Hi Lance,
> > 
> > 	Can you post the output of checkjob <jobid> of an incorrectly
> running job. Let's take a look at what Maui thinks the job is asking
> for.
> > 	
> > 	Might as well add your maui.cfg file also.
> > 
> > 	I've found in the past that procs= is troublesome...
> > 
> >> 
> >> I'm actually beginning to think the problem may be related to
> >> maui.
> Perhaps I'll post this same question to the maui list and see what
> comes
> back.
> >> 
> >> This problem is infuriating though since without the functionality
> working as it should, using procs=X in torque/maui makes torque/maui
> work more like a submission and run system (not a queuing system).
> > 
> > Agreed. HPC cluster job management is normally be set it and forget
> it. Anything else other than maintenance/break fixes/new features
> would
> be ridiculously time consuming.
> > 
> >> 
> >> -Lance
> >> 
> >> 
> >>> 
> >>> Message: 3
> >>> Date: Thu, 17 Nov 2011 17:29:17 -0800
> >>> From: "Brock Palen" <brockp at umich.edu>
> >>> Subject: Re: [torqueusers] procs= not working as documented
> >>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> >>> Message-ID:
> >>> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> >>> Content-Type: text/plain; charset="utf-8"
> >>> 
> >>> Does maui only see one cpu or does mpiexec only see one cpu?
> >>> 
> >>> 
> >>> 
> >>> Brock Palen
> >>> (734)936-1985
> >>> brockp at umich.edu
> >>> - Sent from my Palm Pre, please excuse typos
> >>> On Nov 17, 2011 3:19 PM, Lance Westerhoff
> &lt;lance at quantumbioinc.com&gt; wrote:
> >>> 
> >>> 
> >>> 
> >>> Hello All-
> >>> 
> >>> 
> >>> 
> >>> It appears that when running with the following specs, the procs=
> option does not actually work as expected.
> >>> 
> >>> 
> >>> 
> >>> ==========================================
> >>> 
> >>> 
> >>> 
> >>> #PBS -S /bin/bash
> >>> 
> >>> #PBS -l procs=60
> >>> 
> >>> #PBS -l pmem=700mb
> >>> 
> >>> #PBS -l walltime=744:00:00
> >>> 
> >>> #PBS -j oe
> >>> 
> >>> #PBS -q batch
> >>> 
> >>> 
> >>> 
> >>> torque version: tried 3.0.2. in v2.5.4, I think the procs option
> worked as documented
> >>> 
> >>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a
> >>> complete
> fail in terms of the procs option and it only asks for a single CPU)
> >>> 
> >>> 
> >>> 
> >>> ==========================================
> >>> 
> >>> 
> >>> 
> >>> If there are fewer then 60 processors available in the cluster
> >>> (in
> this case there were 53 available) the job will go in an take
> whatever
> is left instead of waiting for all 60 processors to free up. Any
> thoughts as to why this might be happening? Sometimes it doesn't
> really
> matter and 53 would be almost as good as 60, however if only 2
> processors are available and the user asks for 60, I would hate for
> him
> to go in.
> >>> 
> >>> 
> >>> 
> >>> Thank you for your time!
> >>> 
> >>> 
> >>> 
> >>> -Lance
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > ----------------------
> > Steve Crusan
> > System Administrator
> > Center for Research Computing
> > University of Rochester
> > https://www.crc.rochester.edu/
> > 
> > 
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> > Comment: GPGTools - http://gpgtools.org
> > 
> > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
> > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
> > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
> > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
> > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
> > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
> > =AGW7
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> End of torqueusers Digest, Vol 88, Issue 16
> *******************************************
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list