[torqueusers] Submitting jobs to use multiprocessors.

Gus Correa gus at ldeo.columbia.edu
Fri Mar 21 09:52:18 MDT 2014


PS - If I remember right (but I may be wrong),
root must run pbs_server and pbs_mom,
whereas maui can be run by root or by another user (typically called 
"maui", but it could be another user also.)

What do you have there?

Gus Correa

On 03/21/2014 11:47 AM, Gus Correa wrote:
> Hi Hitesh
>
> See the answers inline, please.
>
> On 03/21/2014 11:00 AM, hitesh chugani wrote:
>> Hi Gus,
>>
>> Sorry to confuse you.. I didn't actually used the symbols "<" and ">" .
>> i have something like this
>> node1 np=2
>> node2 np=8
>>
>
> It sounds right now!
>
>> I did change the numbers to match the number of cores. The issue still
>> shows up.
>>
>> The maui daemon and scheduling is also enabled.The output is:
>>
>> #
>> # Create queues and set their attributes.
>> #
>> #
>> # Create and define queue batch
>> #
>> create queue batch
>> set queue batch queue_type = Execution
>> set queue batch resources_default.nodes = 1
>> set queue batch resources_default.walltime = 01:00:00
>> set queue batch enabled = True
>> set queue batch started = True
>> #
>> # Set server attributes.
>> #
>> *set server scheduling = True*
>> set server acl_hosts = lws7
>> set server managers = hchugani at lws7.uncc.edu <mailto:hchugani at lws7.uncc.edu>
>> set server operators = hchugani at lws7.uncc.edu
>> <mailto:hchugani at lws7.uncc.edu>
>> set server default_queue = batch
>> set server log_events = 511
>> set server mail_from = adm
>> set server scheduler_iteration = 600
>> set server node_check_rate = 150
>> set server tcp_timeout = 300
>> set server job_stat_rate = 45
>> set server poll_jobs = True
>> set server mom_job_sync = True
>> set server keep_completed = 300
>> set server next_job_number = 10
>> set server moab_array_compatible = True
>>
>
> Ah, OK, your emai comes in HTML, so now understand in the above the
> "*" are coming from your email highlighs (bold in HTML, probably).
>
> ***
>
> 1) Torque setup:
>
> Are you running the maui daemon as root or did you create a maui user
> for that, or is it owned by hchugani?
>
> In any case, you need to keep whoever owns Maui as a manager (maybe
> operator also) on the torque/pbs_server setup.
> Say, something like this (use whatever is appropriate for your setup):
>
> $ qmgr
>
> set server managers = root
> set server managers += maui
> set server managers += hchugani
> set server operators = root
> set server operators += maui
> set server operators += hchugani
>
> You can also use the -c option to qmgr to send the commands (in quotes)
> from the shell.
>
> **
>
> 2) Maui setup:
>
> I think it should be reciprocal on the Maui side.
> On $MAUI/maui.cfg, something like this
>
> SERVERHOST            your.pbs_server.machine  #must be your pbs_server
> ...
> ADMIN1                root maui hchugani
>
> **
>
> 3) maui.cfg restrictions
>
> Are you using the standard/boilerplate maui.cfg that comes with Maui,
> or did you change anything in the maui.cfg that may perhaps be blocking
> the jobs (say, limiting the user number of processors, etc)?
>
> **
>
> 4) Diagnostic:
>
> Also, submit a job, one of those that stays in Q state,
> and check it's status with tracejob:
>
> $ qsub myjob
> 1234.pbs_server.edu
>
> $ tracejob 1234
>
> That may tell something about why it is in Q state.
>
> **
>
> 5) Check the logs:
>
> You an also check the pbs_server log for hints on what is
> causing trouble (in $TORQUE/server_logs/YYYYMMDD).
>
> Check also the Maui scheduler log for possible hints on why the jobs
> are not running (in $MAUI/logs/maui.log).
>
> **
>
> I hope this helps,
> Gus Correa
>
>> Thanks,
>> Hitesh Chugani.
>>
>>
>>
>>
>> On Thu, Mar 20, 2014 at 6:05 PM, Gus Correa <gus at ldeo.columbia.edu
>> <mailto:gus at ldeo.columbia.edu>> wrote:
>>
>>      Hi Hitesh
>>
>>      1) Did you actually write the "less than" ("<") and
>>      "greater than" (">") characters in your $TORQUE/server_priv/nodes file?
>>      Or are those "<" and ">" just typos in your email?
>>      Or perhaps you don't want the actual node's names to appear on this
>>      mailing list?
>>
>>        >>     Did you create a $TORQUE/pbs_server/nodes file? *Yes*
>>        >>
>>        >>     What are the contents of that file?
>>        >>     *<node1> np=2
>>        >>     <node2> np=2*
>>        >>
>>
>>
>>      The "<" and ">" shouldn't be there, unless you have very unusual
>>      names for your nodes.
>>      There are also some "*" in the lines above that should not be there,
>>      but you may have added that to the email as a highlight, I don't know.
>>
>>      I expected something like this for the file contents (2 lines only,
>>      no "<" or ">").
>>
>>      node1 np=2
>>      node2 np=8
>>
>>      (You said the nodes have 2 and 8 cores/cpus, so one of them should
>>      have np=2, and the other np=8, unless you don't want to use all
>>      cores.
>>      I am assuming node2 is the one with 8 cores, otherwise
>>      you need to adjust the numbers above accordingly.)
>>
>>      2) You say Maui is enabled.
>>      So, I assume the maui daemon is running, right?
>>
>>      However, you must also enable scheduling on the Torque/PBS server.
>>      Did you enable that option?
>>      What is the output of this?
>>
>>      qmgr -c 'p s' | grep scheduling
>>
>>      If it says "False", you need to do:
>>
>>      qmgr -c  'set server scheduling = True'
>>
>>      I hope this helps,
>>      Gus Correa
>>
>>      On 03/20/2014 02:49 PM, hitesh chugani wrote:
>>       > Hi Sven,
>>       >
>>       > These are the parameters in the job file
>>       >
>>       > #!/bin/bash
>>       > #PBS -l nodes=2:ppn=2
>>       > #PBS -k o
>>       > #PBS -m abe
>>       > #PBS -N JobName
>>       > #PBS -V
>>       > #PBS -j oe
>>       >
>>       > Thanks,
>>       > Hitesh Chugani.
>>       >
>>       >
>>       >
>>       >
>>       >
>>       >
>>       >
>>       > On Thu, Mar 20, 2014 at 2:45 PM, Sven Schumacher
>>       > <schumacher at tfd.uni-hannover.de
>>      <mailto:schumacher at tfd.uni-hannover.de>
>>      <mailto:schumacher at tfd.uni-hannover.de
>>      <mailto:schumacher at tfd.uni-hannover.de>>>
>>       > wrote:
>>       >
>>       >     Hello,
>>       >
>>       >     what PBS-specific parameters do you specify for your
>>      qsub-command or
>>       >     in your job-file?
>>       >     I noticed once, that specifying "mem=" with the total amount of
>>       >     memory needed by the job, results in not starting jobs,
>>      because maui
>>       >     can't decide if it is the memory requirement of the job on one of
>>       >     the nodes or of all jobs together... so please tell us your used
>>       >     qsub-parameters...
>>       >
>>       >     Thanks
>>       >
>>       >     Sven Schumacher
>>       >
>>       >     Am 20.03.2014 19:30, schrieb hitesh chugani:
>>       >>     Hi Gus,
>>       >>
>>       >>
>>       >>     Did you create a $TORQUE/pbs_server/nodes file? *Yes*
>>       >>
>>       >>     What are the contents of that file?
>>       >>     *<node1> np=2
>>       >>     <node2> np=2*
>>       >>
>>       >>     What is the output of "pbsnodes -a"?
>>       >>     *<node1>
>>       >>     *
>>       >>     *     state = free
>>       >>          np = 2
>>       >>          ntype = cluster
>>       >>          status =
>>       >>
>>      rectime=1395339913,varattr=,jobs=,state=free,netload=8159659934
>>      <tel:8159659934>
>>       >>     <tel:8159659934
>>      <tel:8159659934>>,gres=,loadave=0.00,ncpus=2,physmem=3848508kb,availmem=15671808kb,totmem=16300340kb,idletime=89,nusers=2,nsessions=22,sessions=2084
>>       >>     2619 2839 2855 2873 2877 2879 2887 2889 2916 2893 2891 3333 6665
>>       >>     3053 8036 25960 21736 22263 23582 26141 30680,uname=Linux lws81
>>       >>     2.6.18-371.4.1.el5 #1 SMP Wed Jan 8 18:42:07 EST 2014
>>       >>     x86_64,opsys=linux
>>       >>          mom_service_port = 15002
>>       >>          mom_manager_port = 15003
>>       >>
>>       >>     *
>>       >>     *<node2>
>>       >>     *
>>       >>     *     state = free
>>       >>          np = 2
>>       >>          ntype = cluster
>>       >>          status =
>>       >>
>>      rectime=1395339913,varattr=,jobs=,state=free,netload=2817775035
>>      <tel:2817775035>
>>       >>     <tel:2817775035
>>      <tel:2817775035>>,gres=,loadave=0.00,ncpus=8,physmem=16265764kb,availmem=52900464kb,totmem=55259676kb,idletime=187474,nusers=3,nsessions=4,sessions=11923
>>       >>     17547 20030 29392,uname=Linux lws10.uncc.edu
>>      <http://lws10.uncc.edu>
>>       >>     <http://lws10.uncc.edu> 2.6.18-371.4.1.el5 #1 SMP Wed Jan 8
>>       >>     18:42:07 EST 2014 x86_64,opsys=linux
>>       >>          mom_service_port = 15002
>>       >>          mom_manager_port = 15003*
>>       >>
>>       >>
>>       >>     Did you enable scheduling in the pbs_server? *Maui is enabled*
>>       >>
>>       >>
>>       >>     Did you keep the --enable-cpuset configuration option? *No.
>>      I have
>>       >>     disabled it*
>>       >>
>>       >>
>>       >>     I am able to run single/two node single processor
>>       >>     job(nodes=1(and2):ppn=1). But when i am trying to run
>>       >>     multiprocessor jobs(nodes=2:ppn=2 with nodes having 2 and 8
>>      ncpu),
>>       >>     the job is remaining in queue . I am able to forcefully run the
>>       >>     job via qrun. I am using Maui scheduler.
>>       >>
>>       >>
>>       >>     Please help.
>>       >>
>>       >>
>>       >>     Thanks,
>>       >>     Hitesh chugani.
>>       >>
>>       >>
>>       >>
>>       >>
>>       >>
>>       >>     On Mon, Mar 17, 2014 at 7:35 PM, Gus Correa
>>      <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>>       >>     <mailto:gus at ldeo.columbia.edu
>>      <mailto:gus at ldeo.columbia.edu>>> wrote:
>>       >>
>>       >>         Hi Hitesh
>>       >>
>>       >>         Did you create a $TORQUE/pbs_server/nodes file?
>>       >>         What are the contents of that file?
>>       >>         What is the output of "pbsnodes -a"?
>>       >>
>>       >>         Make sure the nodes file is there.
>>       >>         If not, create it again, and restart pbs_server.
>>       >>
>>       >>         Did you enable scheduling in the pbs_server?
>>       >>
>>       >>         Also:
>>       >>
>>       >>         Did you keep the --enable-cpuset configuration option?
>>       >>         If you did:
>>       >>         Do you have a /dev/cpuset directory on your nodes?
>>       >>         Do you have a type cpuset filesystem mounted on /dev/cpuset
>>       >>         on the nodes?
>>       >>
>>       >>         Check this link:
>>       >>
>>       >>
>>      http://docs.adaptivecomputing.com/torque/Content/topics/3-nodes/linuxCpusetSupport.htm
>>       >>
>>       >>         Still in the topic of cpuset:
>>       >>
>>       >>         Are you perhaps running cgroups on the nodes (the cgconfig
>>       >>         service)?
>>       >>
>>       >>         I hope this helps,
>>       >>         Gus Correa
>>       >>
>>       >>         On 03/17/2014 05:45 PM, hitesh chugani wrote:
>>       >>         > Hello,
>>       >>         >
>>       >>         > I have reconfigured torque to disable NUMA support. I am
>>       >>         able to run
>>       >>         > single node single processor job(nodes=1:ppn=1). But
>>      when i
>>       >>         am trying to
>>       >>         > run multiprocessor jobs(nodes=2:ppn=2 with nodes having 2
>>       >>         and 8 ncpu),
>>       >>         > the job is remaining in queue . I am able to
>>      forcefully run
>>       >>         the job via
>>       >>         > qrun. I am using Maui scheduler.  Can anyone please
>>      tell me
>>       >>         what may be
>>       >>         > the issue? is it something to do with Maui scheduler?
>>      Thanks.
>>       >>         >
>>       >>         > Regards,
>>       >>         > Hitesh Chugani.
>>       >>         >
>>       >>         >
>>       >>         > On Mon, Mar 17, 2014 at 12:40 PM, hitesh chugani
>>       >>         > <hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com> <mailto:hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com>>
>>       >>         <mailto:hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com>
>>       >>         <mailto:hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com>>>> wrote:
>>       >>         >
>>       >>         >     I tried nodes=X:ppn=Y option. It still didn't work . I
>>       >>         guess it is
>>       >>         >     something to deal with NUMA option enabling. I am
>>       >>         looking into this
>>       >>         >     issue and will let you guys know . Thanks a lot
>>       >>         >
>>       >>         >
>>       >>         >
>>       >>         >     On Thu, Mar 13, 2014 at 10:22 AM, Ken Nielson
>>       >>         >     <knielson at adaptivecomputing.com
>>      <mailto:knielson at adaptivecomputing.com>
>>       >>         <mailto:knielson at adaptivecomputing.com
>>      <mailto:knielson at adaptivecomputing.com>>
>>       >>         >     <mailto:knielson at adaptivecomputing.com
>>      <mailto:knielson at adaptivecomputing.com>
>>       >>         <mailto:knielson at adaptivecomputing.com
>>      <mailto:knielson at adaptivecomputing.com>>>> wrote:
>>       >>         >
>>       >>         >         Glen is right. There is a regression with procs.
>>       >>         >
>>       >>         >
>>       >>         >         On Wed, Mar 12, 2014 at 5:29 PM,
>>       >>         <glen.beane at gmail.com <mailto:glen.beane at gmail.com>
>>      <mailto:glen.beane at gmail.com <mailto:glen.beane at gmail.com>>
>>       >>         >         <mailto:glen.beane at gmail.com
>>      <mailto:glen.beane at gmail.com>
>>       >>         <mailto:glen.beane at gmail.com
>>      <mailto:glen.beane at gmail.com>>>> wrote:
>>       >>         >
>>       >>         >             I think there is a regression in Torque and
>>       >>         procs only works
>>       >>         >             with Moab now. Try nodes=X:ppn=Y
>>       >>         >
>>       >>         >
>>       >>         >             On Mar 12, 2014, at 6:26 PM, hitesh chugani
>>       >>         >             <hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com>
>>       >>         <mailto:hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com>>
>>       >>         <mailto:hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com>
>>       >>         <mailto:hiteshschugani at gmail.com
>>      <mailto:hiteshschugani at gmail.com>>>>
>>       >>         >             wrote:
>>       >>         >
>>       >>         >>             Hi all,
>>       >>         >>
>>       >>         >>
>>       >>         >>             I am trying to submit a job with to use
>>       >>         >>             multiprocessors(Added #PBS -l procs=4 in the
>>       >>         job script)
>>       >>         >>             but the job is remaining queued forever. I am
>>       >>         using 2
>>       >>         >>             computes nodes (ncpus=8 and 2). Any idea
>>      why is
>>       >>         it not
>>       >>         >>             running? Please help.
>>       >>         >>
>>       >>         >>             I have installed torque using this
>>       >>         configuration option.
>>       >>         >>             *./configure --enable-unixsockets
>>      --enable-cpuset
>>       >>         >>             --enable-geometry-requests
>>      --enable-numa-support *
>>       >>         >>
>>       >>         >>
>>       >>         >>
>>       >>         >>
>>       >>         >>             Thanks,
>>       >>         >>             Hitesh Chugani.
>>       >>         >>             Student Linux specialist
>>       >>         >>             University of North Carolina, Charlotte
>>       >>         >> _______________________________________________
>>       >>         >>
>>       >>         >>             torqueusers mailing list
>>       >>         >> torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>
>>       >>         >>             <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>>
>>       >>         >> http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >>         >
>>       >>         > _______________________________________________
>>       >>         >             torqueusers mailing list
>>       >>         > torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>
>>       >>         >             <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>>
>>       >>         > http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >>         >
>>       >>         >
>>       >>         >
>>       >>         >
>>       >>         >         --
>>       >>         >         Ken Nielson
>>       >>         > +1 801.717.3700 <tel:%2B1%20801.717.3700>
>>      <tel:%2B1%20801.717.3700>
>>       >>         <tel:%2B1%20801.717.3700> office +1 801.717.3738
>>       >>         <tel:%2B1%20801.717.3738>
>>       >>         >         <tel:%2B1%20801.717.3738> fax
>>       >>         >         1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>       >>         > www.adaptivecomputing.com
>>      <http://www.adaptivecomputing.com> <http://www.adaptivecomputing.com>
>>       >>         <http://www.adaptivecomputing.com>
>>       >>         >
>>       >>         >
>>       >>         >         _______________________________________________
>>       >>         >         torqueusers mailing list
>>       >>         > torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>>
>>       >>         > http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >>         >
>>       >>         >
>>       >>         >
>>       >>         >
>>       >>         >
>>       >>         > _______________________________________________
>>       >>         > torqueusers mailing list
>>       >>         > torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>       >>         <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>
>>       >>         > http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >>         >
>>       >>
>>       >>         _______________________________________________
>>       >>         torqueusers mailing list
>>       >> torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>      <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>
>>       >> http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >>
>>       >>
>>       >>
>>       >>
>>       >>     _______________________________________________
>>       >>     torqueusers mailing list
>>       >> torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>        <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>
>>       >> http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >
>>       >
>>       >     --
>>       >     Sven Schumacher - Systemadministrator Tel: (0511)762-2753
>>       >     Leibniz Universitaet Hannover
>>       >     Institut für Turbomaschinen und Fluid-Dynamik       - TFD
>>       >     Appelstraße 9 - 30167 Hannover
>>       >     Institut für Kraftwerkstechnik und Wärmeübertragung - IKW
>>       >     Callinstraße 36 - 30167 Hannover
>>       >
>>       >
>>       >     _______________________________________________
>>       >     torqueusers mailing list
>>       > torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>
>>      <mailto:torqueusers at supercluster.org
>>      <mailto:torqueusers at supercluster.org>>
>>       > http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >
>>       >
>>       >
>>       >
>>       > _______________________________________________
>>       > torqueusers mailing list
>>       > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>       > http://www.supercluster.org/mailman/listinfo/torqueusers
>>       >
>>
>>      _______________________________________________
>>      torqueusers mailing list
>>      torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>      http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list