[torqueusers] torqueusers Digest, Vol 88, Issue 16

RB. Ezhilalan (Principal Physicist, CUH) RB.Ezhilalan at hse.ie
Fri Nov 18 09:44:30 MST 2011


Jason,

I had linux-01 np=1, linux-02 np=1 in the nodes file, despite this, the
job ran on one core (linux-01) only. Then I removed the 'np' option from
the list under the notion, the system will 'autodetect' the cores.

Ezhilalan

Ezhilalan Ramalingam M.Sc.,DABR.,
Principal Physicist (Radiotherapy),
Medical Physics Department,
Cork University Hospital,
Wilton, Cork
Ireland
Tel. 00353 21 4922533
Fax.00353 21 4921300
Email: rb.ezhilalan at hse.ie 

-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of
torqueusers-request at supercluster.org
Sent: 18 November 2011 16:12
To: torqueusers at supercluster.org
Subject: torqueusers Digest, Vol 88, Issue 16

Send torqueusers mailing list submissions to
	torqueusers at supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
	torqueusers-request at supercluster.org

You can reach the person managing the list at
	torqueusers-owner at supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torqueusers digest..."


Today's Topics:

   1. Re: Parallel processing for MC code (Jason Bacon)
   2. Re: procs= not working as documented (Lance Westerhoff)
   3. Re: procs= not working as documented (Steve Crusan)
   4. Re: procs= not working as documented (Lance Westerhoff)


----------------------------------------------------------------------

Message: 1
Date: Fri, 18 Nov 2011 07:57:23 -0600
From: Jason Bacon <jwbacon at tds.net>
Subject: Re: [torqueusers] Parallel processing for MC code
To: Torque Users Mailing List <torqueusers at supercluster.org>
Message-ID: <4EC66443.3080608 at tds.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed


I was only wondering if you had "np=2" in the Linux-01 entry, or if 
Torque was configured to autodetect the number of cores and there were 
two.  That would have explained the scheduling behavior.

Regards,

     -J

On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote:
>
> Hi Jason,
>
> PC1 (linux-01) is a single core PC like PC2, I defined the 
> server_priv/nodes file as;
>
> Linux-01
>
> Linux-02
>
> As you have mentioned may be resource requirement needs to be properly

> set up. Do you have any suggestions?
>
> Many thanks,
>
> Ezhilalan
>
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org 
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of 
> torqueusers-request at supercluster.org
> Sent: 17 November 2011 17:20
> To: torqueusers at supercluster.org
> Subject: torqueusers Digest, Vol 88, Issue 14
>
> Send torqueusers mailing list submissions to
>
> torqueusers at supercluster.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
>       http://www.supercluster.org/mailman/listinfo/torqueusers
>
> or, via email, send a message with subject or body 'help' to
>
>       torqueusers-request at supercluster.org
>
> You can reach the person managing the list at
>
>       torqueusers-owner at supercluster.org
>
> When replying, please edit your Subject line so it is more specific
>
> than "Re: Contents of torqueusers digest..."
>
> Today's Topics:
>
>    1. Re: Random SCP errors when transfering to/from  CREAM sandbox
>
>       (Christopher Samuel)
>
>    2. Re: Random SCP errors when transfering to/from  CREAM sandbox
>
>       (Gila Arrondo  Miguel Angel)
>
>    3. Parallel processing for MC code
>
>       (RB. Ezhilalan (Principal Physicist, CUH))
>
>    4. Re: Parallel processing for MC code (Jason Bacon)
>
>    5. Re: File staging syntax (Steve Traylen)
>
> ----------------------------------------------------------------------
>
> Message: 1
>
> Date: Thu, 17 Nov 2011 13:29:44 +1100
>
> From: Christopher Samuel <samuel at unimelb.edu.au>
>
> Subject: Re: [torqueusers] Random SCP errors when transfering to/from
>
>       CREAM sandbox
>
> To: torqueusers at supercluster.org
>
> Message-ID: <4EC47198.1040709 at unimelb.edu.au>
>
> Content-Type: text/plain; charset=ISO-8859-1
>
> -----BEGIN PGP SIGNED MESSAGE-----
>
> Hash: SHA1
>
> On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote:
>
> > Many thanks for your answer. We've made sure that the
>
> > keys are okay, as well as disabling hoskeychecking to
>
> > test it.
>
> Can you try and scp as that user to see whether it
>
> complains about anything else ?
>
> It may be that it is prompting the user to accept a
>
> host key if they don't already have it.
>
> cheers,
>
> Chris
>
> - -- 
>
>     Christopher Samuel - Senior Systems Administrator
>
>  VLSCI - Victorian Life Sciences Computation Initiative
>
>  Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
>
>          http://www.vlsci.unimelb.edu.au/
>
> -----BEGIN PGP SIGNATURE-----
>
> Version: GnuPG v1.4.11 (GNU/Linux)
>
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW
>
> sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS
>
> =VPqK
>
> -----END PGP SIGNATURE-----
>
> ------------------------------
>
> Message: 2
>
> Date: Thu, 17 Nov 2011 07:55:50 +0000
>
> From: "Gila Arrondo  Miguel Angel" <miguel.gila at cscs.ch>
>
> Subject: Re: [torqueusers] Random SCP errors when transfering to/from
>
>       CREAM sandbox
>
> To: Torque Users Mailing List <torqueusers at supercluster.org>
>
> Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi Chris,
>
> I've done that in many WNs and with different users, so I don't think 
> that is be the issue. I've also checked for scheduled tasks that 
> interact with the ssh keys, but the errors happen at random times, not

> when the scheduled tasks run... :-S
>
> I'm running out of options here.
>
> Cheers,
>
> Miguel
>
> On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
>
> > Hash: SHA1
>
> >
>
> > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote:
>
> >
>
> >> Many thanks for your answer. We've made sure that the
>
> >> keys are okay, as well as disabling hoskeychecking to
>
> >> test it.
>
> >
>
> > Can you try and scp as that user to see whether it
>
> > complains about anything else ?
>
> >
>
> > It may be that it is prompting the user to accept a
>
> > host key if they don't already have it.
>
> >
>
> > cheers,
>
> > Chris
>
> > - --
>
> >    Christopher Samuel - Senior Systems Administrator
>
> > VLSCI - Victorian Life Sciences Computation Initiative
>
> > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
>
> >         http://www.vlsci.unimelb.edu.au/
>
> >
>
> > -----BEGIN PGP SIGNATURE-----
>
> > Version: GnuPG v1.4.11 (GNU/Linux)
>
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> >
>
> > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW
>
> > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS
>
> > =VPqK
>
> > -----END PGP SIGNATURE-----
>
> > _______________________________________________
>
> > torqueusers mailing list
>
> > torqueusers at supercluster.org
>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --
>
> Miguel Gila
>
> CSCS Swiss National Supercomputing Centre
>
> HPC Solutions
>
> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
>
> miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22
>
> -------------- next part --------------
>
> A non-text attachment was scrubbed...
>
> Name: smime.p7s
>
> Type: application/pkcs7-signature
>
> Size: 3239 bytes
>
> Desc: not available
>
> Url : 
>
http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2
14ea9d6/attachment-0001.bin
>
> ------------------------------
>
> Message: 3
>
> Date: Thu, 17 Nov 2011 10:14:32 -0000
>
> From: "RB. Ezhilalan (Principal Physicist, CUH)" <RB.Ezhilalan at hse.ie>
>
> Subject: [torqueusers] Parallel processing for MC code
>
> To: torqueusers at supercluster.org
>
> Message-ID:
>
> <DB0960F9D7310D4BA87B4985061511A703B30D96 at CKVEX004.south.health.local>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi All,
>
> I've been trying to set up Torque queuing system on two SUSE10.1 linux
>
> PCs (PIII!).
>
> Installed the linux on both PCs, exported home directory containing
>
> BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH
password
>
> less communication. All seems to be working fine.
>
> Downloaded latest version of Torque (number not handy) installed
>
> PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2.
>
> PBS 'nodes' file was created as per guidelines, PBS_SERVER and QUEUE
>
> attributes were set as default.
>
> Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are free.
I
>
> am not sure whether this confirms PBS/Torque set up correctly.
>
> I was able to run an executable BEAMnrc user code in batch mode i.e
>
> using 'exb' command aliased to 'qsub' and sources a built in job
script
>
> file with option p=1 (single job).
>
> To split the jobs in to two, so that it runs in parallel on the two
PCs,
>
> option p=2 should be issued. However, what I noticed was, the job ran
>
> twice on the first PC (PC1) but not on both.
>
> I can't figure out what went wrong, I suspect PBS setup could have
some
>
> issues, May be I can try running the job specifically on PC2 if so
what
>
> command I need to give?
>
> I would be grateful for any advice!
>
> Kind Regards,
>
> Ezhilalan
>
> -------------- next part --------------
>
> An HTML attachment was scrubbed...
>
> URL: 
>
http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0
6e4a798/attachment-0001.html 
>
>
> ------------------------------
>
> Message: 4
>
> Date: Thu, 17 Nov 2011 10:18:18 -0600
>
> From: Jason Bacon <jwbacon at tds.net>
>
> Subject: Re: [torqueusers] Parallel processing for MC code
>
> To: Torque Users Mailing List <torqueusers at supercluster.org>
>
> Message-ID: <4EC533CA.2000902 at tds.net>
>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> How many cores does PC1 have? Note that Torque schedules cores, not
>
> computers, unless you specifically tell it to with resource
requirements.
>
> Regards,
>
> -J
>
> On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote:
>
> >
>
> > Hi All,
>
> >
>
> > I?ve been trying to set up Torque queuing system on two SUSE10.1
linux
>
> > PCs (PIII!).
>
> >
>
> > Installed the linux on both PCs, exported home directory containing
>
> > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH
>
> > password less communication. All seems to be working fine.
>
> >
>
> > Downloaded latest version of Torque (number not handy) installed
>
> > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2.
>
> >
>
> > PBS ?nodes? file was created as per guidelines, PBS_SERVER and QUEUE
>
> > attributes were set as default.
>
> >
>
> > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are
free.
>
> > I am not sure whether this confirms PBS/Torque set up correctly.
>
> >
>
> > I was able to run an executable BEAMnrc user code in batch mode i.e
>
> > using ?exb? command aliased to ?qsub? and sources a built in job
>
> > script file with option p=1 (single job).
>
> >
>
> > To split the jobs in to two, so that it runs in parallel on the two
>
> > PCs, option p=2 should be issued. However, what I noticed was, the
job
>
> > ran twice on the first PC (PC1) but not on both.
>
> >
>
> > I can?t figure out what went wrong, I suspect PBS setup could have
>
> > some issues, May be I can try running the job specifically on PC2 if
>
> > so what command I need to give?
>
> >
>
> > I would be grateful for any advice!
>
> >
>
> > Kind Regards,
>
> >
>
> > Ezhilalan
>
> >
>
> >
>
> > _______________________________________________
>
> > torqueusers mailing list
>
> > torqueusers at supercluster.org
>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> -- 
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Jason W. Bacon
>
> jwbacon at tds.net
>
> http://personalpages.tds.net/~jwbacon
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> ------------------------------
>
> Message: 5
>
> Date: Thu, 17 Nov 2011 18:19:14 +0100
>
> From: Steve Traylen <steve.traylen at cern.ch>
>
> Subject: Re: [torqueusers] File staging syntax
>
> To: Torque Users Mailing List <torqueusers at supercluster.org>
>
> Message-ID:
>
> <CAOXEVSCY2CC-=ajKvcc6PAgKd5S6fupRkgNp79KL_w3=k2Xy1A at mail.gmail.com>
>
> Keywords: CERN SpamKiller Note: -50
>
> Content-Type: text/plain; charset="ISO-8859-1"
>
> On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson
>
> <knielson at adaptivecomputing.com> wrote>
>
> > Andr?,
>
> >
>
> > I have not yet had time to reproduce this. I did look through the 
> change log and there are two suspects. One is in 2.5.6, a fix for 
> Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133.
>
> >
>
> > That is as far as I am right now. I will try to get to this as soon 
> as I can.
>
> Hi Ken,
>
>  Did you manage to track this down. It's currently making upgrading a 
> pain.
>
> Steve.
>
> -- 
>
> Steve Traylen
>
> ------------------------------
>
> _______________________________________________
>
> torqueusers mailing list
>
> torqueusers at supercluster.org
>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> End of torqueusers Digest, Vol 88, Issue 14
>
> *******************************************
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason W. Bacon
jwbacon at tds.net
http://personalpages.tds.net/~jwbacon
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




------------------------------

Message: 2
Date: Fri, 18 Nov 2011 09:33:12 -0500
From: Lance Westerhoff <lance at quantumbioinc.com>
Subject: Re: [torqueusers] procs= not working as documented
To: torqueusers at supercluster.org
Message-ID: <CCDE8276-0991-41D7-A719-239F7CB1C666 at quantumbioinc.com>
Content-Type: text/plain; charset=us-ascii

The request that is placed is for procs=60. Both torque and maui see
that there are only 53 processors available and instead of letting the
job sit in the queue and wait for all 60 processors to become available,
it goes ahead and runs the job with what's available. Now if the user
could ask for procs=[50-60] where 50 is the minimum number of processors
to provide and 60 is the maximum, this would be a feature. But as it
stands, if the user asks for 60 processors and ends up with 2
processors, the job just won't scale properly and he may as well kill it
(when it shouldn't have run anyway).

I'm actually beginning to think the problem may be related to maui.
Perhaps I'll post this same question to the maui list and see what comes
back. 

This problem is infuriating though since without the functionality
working as it should, using procs=X in torque/maui makes torque/maui
work more like a submission and run system (not a queuing system).

-Lance


> 
> Message: 3
> Date: Thu, 17 Nov 2011 17:29:17 -0800
> From: "Brock Palen" <brockp at umich.edu>
> Subject: Re: [torqueusers] procs= not working as documented
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Does maui only see one cpu or does mpiexec only see one cpu?
> 
> 
> 
> Brock Palen
> (734)936-1985
> brockp at umich.edu
> - Sent from my Palm Pre, please excuse typos
> On Nov 17, 2011 3:19 PM, Lance Westerhoff
&lt;lance at quantumbioinc.com&gt; wrote: 
> 
> 
> 
> Hello All-
> 
> 
> 
> It appears that when running with the following specs, the procs=
option does not actually work as expected.
> 
> 
> 
> ==========================================
> 
> 
> 
> #PBS -S /bin/bash
> 
> #PBS -l procs=60
> 
> #PBS -l pmem=700mb
> 
> #PBS -l walltime=744:00:00
> 
> #PBS -j oe
> 
> #PBS -q batch
> 
> 
> 
> torque version: tried 3.0.2. in v2.5.4, I think the procs option
worked as documented
> 
> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
fail in terms of the procs option and it only asks for a single CPU)
> 
> 
> 
> ==========================================
> 
> 
> 
> If there are fewer then 60 processors available in the cluster (in
this case there were 53 available) the job will go in an take whatever
is left instead of waiting for all 60 processors to free up. Any
thoughts as to why this might be happening? Sometimes it doesn't really
matter and 53 would be almost as good as 60, however if only 2
processors are available and the user asks for 60, I would hate for him
to go in.
> 
> 
> 
> Thank you for your time!
> 
> 
> 
> -Lance
> 
> 
> 
> 



------------------------------

Message: 3
Date: Fri, 18 Nov 2011 09:47:24 -0500
From: Steve Crusan <scrusan at ur.rochester.edu>
Subject: Re: [torqueusers] procs= not working as documented
To: Torque Users Mailing List <torqueusers at supercluster.org>
Message-ID: <B2DF69B9-AEB2-4972-8936-EE2F528D07D5 at ur.rochester.edu>
Content-Type: text/plain; charset=us-ascii

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:

> The request that is placed is for procs=60. Both torque and maui see
that there are only 53 processors available and instead of letting the
job sit in the queue and wait for all 60 processors to become available,
it goes ahead and runs the job with what's available. Now if the user
could ask for procs=[50-60] where 50 is the minimum number of processors
to provide and 60 is the maximum, this would be a feature. But as it
stands, if the user asks for 60 processors and ends up with 2
processors, the job just won't scale properly and he may as well kill it
(when it shouldn't have run anyway).

Hi Lance,

	Can you post the output of checkjob <jobid> of an incorrectly
running job. Let's take a look at what Maui thinks the job is asking
for. 
	
	Might as well add your maui.cfg file also. 

	I've found in the past that procs= is troublesome...

> 
> I'm actually beginning to think the problem may be related to maui.
Perhaps I'll post this same question to the maui list and see what comes
back. 
> 
> This problem is infuriating though since without the functionality
working as it should, using procs=X in torque/maui makes torque/maui
work more like a submission and run system (not a queuing system).

Agreed. HPC cluster job management is normally be set it and forget it.
Anything else other than maintenance/break fixes/new features would be
ridiculously time consuming.

> 
> -Lance
> 
> 
>> 
>> Message: 3
>> Date: Thu, 17 Nov 2011 17:29:17 -0800
>> From: "Brock Palen" <brockp at umich.edu>
>> Subject: Re: [torqueusers] procs= not working as documented
>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Does maui only see one cpu or does mpiexec only see one cpu?
>> 
>> 
>> 
>> Brock Palen
>> (734)936-1985
>> brockp at umich.edu
>> - Sent from my Palm Pre, please excuse typos
>> On Nov 17, 2011 3:19 PM, Lance Westerhoff
&lt;lance at quantumbioinc.com&gt; wrote: 
>> 
>> 
>> 
>> Hello All-
>> 
>> 
>> 
>> It appears that when running with the following specs, the procs=
option does not actually work as expected.
>> 
>> 
>> 
>> ==========================================
>> 
>> 
>> 
>> #PBS -S /bin/bash
>> 
>> #PBS -l procs=60
>> 
>> #PBS -l pmem=700mb
>> 
>> #PBS -l walltime=744:00:00
>> 
>> #PBS -j oe
>> 
>> #PBS -q batch
>> 
>> 
>> 
>> torque version: tried 3.0.2. in v2.5.4, I think the procs option
worked as documented
>> 
>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
fail in terms of the procs option and it only asks for a single CPU)
>> 
>> 
>> 
>> ==========================================
>> 
>> 
>> 
>> If there are fewer then 60 processors available in the cluster (in
this case there were 53 available) the job will go in an take whatever
is left instead of waiting for all 60 processors to free up. Any
thoughts as to why this might be happening? Sometimes it doesn't really
matter and 53 would be almost as good as 60, however if only 2
processors are available and the user asks for 60, I would hate for him
to go in.
>> 
>> 
>> 
>> Thank you for your time!
>> 
>> 
>> 
>> -Lance
>> 
>> 
>> 
>> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

 ----------------------
 Steve Crusan
 System Administrator
 Center for Research Computing
 University of Rochester
 https://www.crc.rochester.edu/


-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
=AGW7
-----END PGP SIGNATURE-----


------------------------------

Message: 4
Date: Fri, 18 Nov 2011 11:12:06 -0500
From: Lance Westerhoff <lance at quantumbioinc.com>
Subject: Re: [torqueusers] procs= not working as documented
To: Torque Users Mailing List <torqueusers at supercluster.org>
Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE at quantumbioinc.com>
Content-Type: text/plain; charset=us-ascii


Hi Steve-

Here you go. Here is the top few lines of the job script. I have then
provided the output you requested long with the maui.cfg. If you need
anything further, certainly please let me know.

Thanks for your help!

===============

 + head job.pbs

#!/bin/bash
#PBS -S /bin/bash
#PBS -l procs=100
#PBS -l pmem=700mb
#PBS -l walltime=744:00:00
#PBS -j oe
#PBS -q batch

Report run on Fri Nov 18 10:49:38 EST 2011
+ pbsnodes --version
version: 3.0.2
+ diagnose --version
maui client version 3.2.6p21
+ checkjob 371010


checking job 371010

State: Running
Creds:  user:josh  group:games  class:batch  qos:DEFAULT
WallTime: 00:02:35 of 31:00:00:00
SubmitTime: Fri Nov 18 10:46:33
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Fri Nov 18 10:46:34
Total Tasks: 1

Req[0]  TaskCount: 26  Partition: DEFAULT
Network: [NONE]  Memory >= 700M  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 700M
NodeCount: 10
Allocated Nodes:
[compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3]
[compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2]
[compute-0-13:2][compute-0-14:2]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '371010' (-00:02:09 -> 30:23:57:51  Duration: 31:00:00:00)
PE:  26.00  StartPriority:  4716

+ cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]"
SERVERHOST            gondor
ADMIN1                maui root
ADMIN3                ALL
RMCFG[base]  TYPE=PBS
AMCFG[bank]  TYPE=NONE
RMPOLLINTERVAL        00:01:00
SERVERPORT            42559
SERVERMODE            NORMAL
LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3
QUEUETIMEWEIGHT       1 
FSPOLICY              DEDICATEDPS
FSDEPTH               7
FSINTERVAL            86400
FSDECAY               0.50
FSWEIGHT              200
FSUSERWEIGHT          1
FSGROUPWEIGHT         1000
FSQOSWEIGHT           1000
FSACCOUNTWEIGHT       1
FSCLASSWEIGHT         1000
USERWEIGHT            4
BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST
NODEALLOCATIONPOLICY  MINRESOURCE
RESERVATIONDEPTH            8
MAXJOBPERUSERPOLICY         OFF
MAXJOBPERUSERCOUNT          8
MAXPROCPERUSERPOLICY        OFF
MAXPROCPERUSERCOUNT         256
MAXPROCSECONDPERUSERPOLICY  OFF
MAXPROCSECONDPERUSERCOUNT   36864000
MAXJOBQUEUEDPERUSERPOLICY   OFF
MAXJOBQUEUEDPERUSERCOUNT    2
JOBNODEMATCHPOLICY          EXACTNODE
NODEACCESSPOLICY            SHARED
JOBMAXOVERRUN 99:00:00:00
DEFERCOUNT 8192
DEFERTIME  0
CLASSCFG[developer] FSTARGET=40.00+
CLASSCFG[lowprio] PRIORITY=-1000
SRCFG[developer] CLASSLIST=developer
SRCFG[developer] ACCESS=dedicated
SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri
SRCFG[developer] STARTTIME=08:00:00
SRCFG[developer] ENDTIME=18:00:00
SRCFG[developer] TIMELIMIT=2:00:00
SRCFG[developer] RESOURCES=PROCS(8)
USERCFG[DEFAULT]      FSTARGET=100.0

===============

-Lance


On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote:
> 
>> The request that is placed is for procs=60. Both torque and maui see
that there are only 53 processors available and instead of letting the
job sit in the queue and wait for all 60 processors to become available,
it goes ahead and runs the job with what's available. Now if the user
could ask for procs=[50-60] where 50 is the minimum number of processors
to provide and 60 is the maximum, this would be a feature. But as it
stands, if the user asks for 60 processors and ends up with 2
processors, the job just won't scale properly and he may as well kill it
(when it shouldn't have run anyway).
> 
> Hi Lance,
> 
> 	Can you post the output of checkjob <jobid> of an incorrectly
running job. Let's take a look at what Maui thinks the job is asking
for. 
> 	
> 	Might as well add your maui.cfg file also. 
> 
> 	I've found in the past that procs= is troublesome...
> 
>> 
>> I'm actually beginning to think the problem may be related to maui.
Perhaps I'll post this same question to the maui list and see what comes
back. 
>> 
>> This problem is infuriating though since without the functionality
working as it should, using procs=X in torque/maui makes torque/maui
work more like a submission and run system (not a queuing system).
> 
> Agreed. HPC cluster job management is normally be set it and forget
it. Anything else other than maintenance/break fixes/new features would
be ridiculously time consuming.
> 
>> 
>> -Lance
>> 
>> 
>>> 
>>> Message: 3
>>> Date: Thu, 17 Nov 2011 17:29:17 -0800
>>> From: "Brock Palen" <brockp at umich.edu>
>>> Subject: Re: [torqueusers] procs= not working as documented
>>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com>
>>> Content-Type: text/plain; charset="utf-8"
>>> 
>>> Does maui only see one cpu or does mpiexec only see one cpu?
>>> 
>>> 
>>> 
>>> Brock Palen
>>> (734)936-1985
>>> brockp at umich.edu
>>> - Sent from my Palm Pre, please excuse typos
>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff
&lt;lance at quantumbioinc.com&gt; wrote: 
>>> 
>>> 
>>> 
>>> Hello All-
>>> 
>>> 
>>> 
>>> It appears that when running with the following specs, the procs=
option does not actually work as expected.
>>> 
>>> 
>>> 
>>> ==========================================
>>> 
>>> 
>>> 
>>> #PBS -S /bin/bash
>>> 
>>> #PBS -l procs=60
>>> 
>>> #PBS -l pmem=700mb
>>> 
>>> #PBS -l walltime=744:00:00
>>> 
>>> #PBS -j oe
>>> 
>>> #PBS -q batch
>>> 
>>> 
>>> 
>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option
worked as documented
>>> 
>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete
fail in terms of the procs option and it only asks for a single CPU)
>>> 
>>> 
>>> 
>>> ==========================================
>>> 
>>> 
>>> 
>>> If there are fewer then 60 processors available in the cluster (in
this case there were 53 available) the job will go in an take whatever
is left instead of waiting for all 60 processors to free up. Any
thoughts as to why this might be happening? Sometimes it doesn't really
matter and 53 would be almost as good as 60, however if only 2
processors are available and the user asks for 60, I would hate for him
to go in.
>>> 
>>> 
>>> 
>>> Thank you for your time!
>>> 
>>> 
>>> 
>>> -Lance
>>> 
>>> 
>>> 
>>> 
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> ----------------------
> Steve Crusan
> System Administrator
> Center for Research Computing
> University of Rochester
> https://www.crc.rochester.edu/
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/
> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28
> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ
> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8
> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv
> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c=
> =AGW7
> -----END PGP SIGNATURE-----
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



------------------------------

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


End of torqueusers Digest, Vol 88, Issue 16
*******************************************


More information about the torqueusers mailing list