[torqueusers] Submitting jobs to use multiprocessors.

Gus Correa gus at ldeo.columbia.edu
Fri Mar 21 09:47:57 MDT 2014


Hi Hitesh

See the answers inline, please.

On 03/21/2014 11:00 AM, hitesh chugani wrote:
> Hi Gus,
>
> Sorry to confuse you.. I didn't actually used the symbols "<" and ">" .
> i have something like this
> node1 np=2
> node2 np=8
>

It sounds right now!

> I did change the numbers to match the number of cores. The issue still
> shows up.
>
> The maui daemon and scheduling is also enabled.The output is:
>
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> *set server scheduling = True*
> set server acl_hosts = lws7
> set server managers = hchugani at lws7.uncc.edu <mailto:hchugani at lws7.uncc.edu>
> set server operators = hchugani at lws7.uncc.edu
> <mailto:hchugani at lws7.uncc.edu>
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 300
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server keep_completed = 300
> set server next_job_number = 10
> set server moab_array_compatible = True
>

Ah, OK, your emai comes in HTML, so now understand in the above the
"*" are coming from your email highlighs (bold in HTML, probably).

***

1) Torque setup:

Are you running the maui daemon as root or did you create a maui user 
for that, or is it owned by hchugani?

In any case, you need to keep whoever owns Maui as a manager (maybe 
operator also) on the torque/pbs_server setup.
Say, something like this (use whatever is appropriate for your setup):

$ qmgr

set server managers = root
set server managers += maui
set server managers += hchugani
set server operators = root
set server operators += maui
set server operators += hchugani

You can also use the -c option to qmgr to send the commands (in quotes) 
from the shell.

**

2) Maui setup:

I think it should be reciprocal on the Maui side.
On $MAUI/maui.cfg, something like this

SERVERHOST            your.pbs_server.machine  #must be your pbs_server
...
ADMIN1                root maui hchugani

**

3) maui.cfg restrictions

Are you using the standard/boilerplate maui.cfg that comes with Maui,
or did you change anything in the maui.cfg that may perhaps be blocking
the jobs (say, limiting the user number of processors, etc)?

**

4) Diagnostic:

Also, submit a job, one of those that stays in Q state,
and check it's status with tracejob:

$ qsub myjob
1234.pbs_server.edu

$ tracejob 1234

That may tell something about why it is in Q state.

**

5) Check the logs:

You an also check the pbs_server log for hints on what is
causing trouble (in $TORQUE/server_logs/YYYYMMDD).

Check also the Maui scheduler log for possible hints on why the jobs
are not running (in $MAUI/logs/maui.log).

**

I hope this helps,
Gus Correa

> Thanks,
> Hitesh Chugani.
>
>
>
>
> On Thu, Mar 20, 2014 at 6:05 PM, Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     Hi Hitesh
>
>     1) Did you actually write the "less than" ("<") and
>     "greater than" (">") characters in your $TORQUE/server_priv/nodes file?
>     Or are those "<" and ">" just typos in your email?
>     Or perhaps you don't want the actual node's names to appear on this
>     mailing list?
>
>       >>     Did you create a $TORQUE/pbs_server/nodes file? *Yes*
>       >>
>       >>     What are the contents of that file?
>       >>     *<node1> np=2
>       >>     <node2> np=2*
>       >>
>
>
>     The "<" and ">" shouldn't be there, unless you have very unusual
>     names for your nodes.
>     There are also some "*" in the lines above that should not be there,
>     but you may have added that to the email as a highlight, I don't know.
>
>     I expected something like this for the file contents (2 lines only,
>     no "<" or ">").
>
>     node1 np=2
>     node2 np=8
>
>     (You said the nodes have 2 and 8 cores/cpus, so one of them should
>     have np=2, and the other np=8, unless you don't want to use all
>     cores.
>     I am assuming node2 is the one with 8 cores, otherwise
>     you need to adjust the numbers above accordingly.)
>
>     2) You say Maui is enabled.
>     So, I assume the maui daemon is running, right?
>
>     However, you must also enable scheduling on the Torque/PBS server.
>     Did you enable that option?
>     What is the output of this?
>
>     qmgr -c 'p s' | grep scheduling
>
>     If it says "False", you need to do:
>
>     qmgr -c  'set server scheduling = True'
>
>     I hope this helps,
>     Gus Correa
>
>     On 03/20/2014 02:49 PM, hitesh chugani wrote:
>      > Hi Sven,
>      >
>      > These are the parameters in the job file
>      >
>      > #!/bin/bash
>      > #PBS -l nodes=2:ppn=2
>      > #PBS -k o
>      > #PBS -m abe
>      > #PBS -N JobName
>      > #PBS -V
>      > #PBS -j oe
>      >
>      > Thanks,
>      > Hitesh Chugani.
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      > On Thu, Mar 20, 2014 at 2:45 PM, Sven Schumacher
>      > <schumacher at tfd.uni-hannover.de
>     <mailto:schumacher at tfd.uni-hannover.de>
>     <mailto:schumacher at tfd.uni-hannover.de
>     <mailto:schumacher at tfd.uni-hannover.de>>>
>      > wrote:
>      >
>      >     Hello,
>      >
>      >     what PBS-specific parameters do you specify for your
>     qsub-command or
>      >     in your job-file?
>      >     I noticed once, that specifying "mem=" with the total amount of
>      >     memory needed by the job, results in not starting jobs,
>     because maui
>      >     can't decide if it is the memory requirement of the job on one of
>      >     the nodes or of all jobs together... so please tell us your used
>      >     qsub-parameters...
>      >
>      >     Thanks
>      >
>      >     Sven Schumacher
>      >
>      >     Am 20.03.2014 19:30, schrieb hitesh chugani:
>      >>     Hi Gus,
>      >>
>      >>
>      >>     Did you create a $TORQUE/pbs_server/nodes file? *Yes*
>      >>
>      >>     What are the contents of that file?
>      >>     *<node1> np=2
>      >>     <node2> np=2*
>      >>
>      >>     What is the output of "pbsnodes -a"?
>      >>     *<node1>
>      >>     *
>      >>     *     state = free
>      >>          np = 2
>      >>          ntype = cluster
>      >>          status =
>      >>
>     rectime=1395339913,varattr=,jobs=,state=free,netload=8159659934
>     <tel:8159659934>
>      >>     <tel:8159659934
>     <tel:8159659934>>,gres=,loadave=0.00,ncpus=2,physmem=3848508kb,availmem=15671808kb,totmem=16300340kb,idletime=89,nusers=2,nsessions=22,sessions=2084
>      >>     2619 2839 2855 2873 2877 2879 2887 2889 2916 2893 2891 3333 6665
>      >>     3053 8036 25960 21736 22263 23582 26141 30680,uname=Linux lws81
>      >>     2.6.18-371.4.1.el5 #1 SMP Wed Jan 8 18:42:07 EST 2014
>      >>     x86_64,opsys=linux
>      >>          mom_service_port = 15002
>      >>          mom_manager_port = 15003
>      >>
>      >>     *
>      >>     *<node2>
>      >>     *
>      >>     *     state = free
>      >>          np = 2
>      >>          ntype = cluster
>      >>          status =
>      >>
>     rectime=1395339913,varattr=,jobs=,state=free,netload=2817775035
>     <tel:2817775035>
>      >>     <tel:2817775035
>     <tel:2817775035>>,gres=,loadave=0.00,ncpus=8,physmem=16265764kb,availmem=52900464kb,totmem=55259676kb,idletime=187474,nusers=3,nsessions=4,sessions=11923
>      >>     17547 20030 29392,uname=Linux lws10.uncc.edu
>     <http://lws10.uncc.edu>
>      >>     <http://lws10.uncc.edu> 2.6.18-371.4.1.el5 #1 SMP Wed Jan 8
>      >>     18:42:07 EST 2014 x86_64,opsys=linux
>      >>          mom_service_port = 15002
>      >>          mom_manager_port = 15003*
>      >>
>      >>
>      >>     Did you enable scheduling in the pbs_server? *Maui is enabled*
>      >>
>      >>
>      >>     Did you keep the --enable-cpuset configuration option? *No.
>     I have
>      >>     disabled it*
>      >>
>      >>
>      >>     I am able to run single/two node single processor
>      >>     job(nodes=1(and2):ppn=1). But when i am trying to run
>      >>     multiprocessor jobs(nodes=2:ppn=2 with nodes having 2 and 8
>     ncpu),
>      >>     the job is remaining in queue . I am able to forcefully run the
>      >>     job via qrun. I am using Maui scheduler.
>      >>
>      >>
>      >>     Please help.
>      >>
>      >>
>      >>     Thanks,
>      >>     Hitesh chugani.
>      >>
>      >>
>      >>
>      >>
>      >>
>      >>     On Mon, Mar 17, 2014 at 7:35 PM, Gus Correa
>     <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>      >>     <mailto:gus at ldeo.columbia.edu
>     <mailto:gus at ldeo.columbia.edu>>> wrote:
>      >>
>      >>         Hi Hitesh
>      >>
>      >>         Did you create a $TORQUE/pbs_server/nodes file?
>      >>         What are the contents of that file?
>      >>         What is the output of "pbsnodes -a"?
>      >>
>      >>         Make sure the nodes file is there.
>      >>         If not, create it again, and restart pbs_server.
>      >>
>      >>         Did you enable scheduling in the pbs_server?
>      >>
>      >>         Also:
>      >>
>      >>         Did you keep the --enable-cpuset configuration option?
>      >>         If you did:
>      >>         Do you have a /dev/cpuset directory on your nodes?
>      >>         Do you have a type cpuset filesystem mounted on /dev/cpuset
>      >>         on the nodes?
>      >>
>      >>         Check this link:
>      >>
>      >>
>     http://docs.adaptivecomputing.com/torque/Content/topics/3-nodes/linuxCpusetSupport.htm
>      >>
>      >>         Still in the topic of cpuset:
>      >>
>      >>         Are you perhaps running cgroups on the nodes (the cgconfig
>      >>         service)?
>      >>
>      >>         I hope this helps,
>      >>         Gus Correa
>      >>
>      >>         On 03/17/2014 05:45 PM, hitesh chugani wrote:
>      >>         > Hello,
>      >>         >
>      >>         > I have reconfigured torque to disable NUMA support. I am
>      >>         able to run
>      >>         > single node single processor job(nodes=1:ppn=1). But
>     when i
>      >>         am trying to
>      >>         > run multiprocessor jobs(nodes=2:ppn=2 with nodes having 2
>      >>         and 8 ncpu),
>      >>         > the job is remaining in queue . I am able to
>     forcefully run
>      >>         the job via
>      >>         > qrun. I am using Maui scheduler.  Can anyone please
>     tell me
>      >>         what may be
>      >>         > the issue? is it something to do with Maui scheduler?
>     Thanks.
>      >>         >
>      >>         > Regards,
>      >>         > Hitesh Chugani.
>      >>         >
>      >>         >
>      >>         > On Mon, Mar 17, 2014 at 12:40 PM, hitesh chugani
>      >>         > <hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com> <mailto:hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com>>
>      >>         <mailto:hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com>
>      >>         <mailto:hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com>>>> wrote:
>      >>         >
>      >>         >     I tried nodes=X:ppn=Y option. It still didn't work . I
>      >>         guess it is
>      >>         >     something to deal with NUMA option enabling. I am
>      >>         looking into this
>      >>         >     issue and will let you guys know . Thanks a lot
>      >>         >
>      >>         >
>      >>         >
>      >>         >     On Thu, Mar 13, 2014 at 10:22 AM, Ken Nielson
>      >>         >     <knielson at adaptivecomputing.com
>     <mailto:knielson at adaptivecomputing.com>
>      >>         <mailto:knielson at adaptivecomputing.com
>     <mailto:knielson at adaptivecomputing.com>>
>      >>         >     <mailto:knielson at adaptivecomputing.com
>     <mailto:knielson at adaptivecomputing.com>
>      >>         <mailto:knielson at adaptivecomputing.com
>     <mailto:knielson at adaptivecomputing.com>>>> wrote:
>      >>         >
>      >>         >         Glen is right. There is a regression with procs.
>      >>         >
>      >>         >
>      >>         >         On Wed, Mar 12, 2014 at 5:29 PM,
>      >>         <glen.beane at gmail.com <mailto:glen.beane at gmail.com>
>     <mailto:glen.beane at gmail.com <mailto:glen.beane at gmail.com>>
>      >>         >         <mailto:glen.beane at gmail.com
>     <mailto:glen.beane at gmail.com>
>      >>         <mailto:glen.beane at gmail.com
>     <mailto:glen.beane at gmail.com>>>> wrote:
>      >>         >
>      >>         >             I think there is a regression in Torque and
>      >>         procs only works
>      >>         >             with Moab now. Try nodes=X:ppn=Y
>      >>         >
>      >>         >
>      >>         >             On Mar 12, 2014, at 6:26 PM, hitesh chugani
>      >>         >             <hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com>
>      >>         <mailto:hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com>>
>      >>         <mailto:hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com>
>      >>         <mailto:hiteshschugani at gmail.com
>     <mailto:hiteshschugani at gmail.com>>>>
>      >>         >             wrote:
>      >>         >
>      >>         >>             Hi all,
>      >>         >>
>      >>         >>
>      >>         >>             I am trying to submit a job with to use
>      >>         >>             multiprocessors(Added #PBS -l procs=4 in the
>      >>         job script)
>      >>         >>             but the job is remaining queued forever. I am
>      >>         using 2
>      >>         >>             computes nodes (ncpus=8 and 2). Any idea
>     why is
>      >>         it not
>      >>         >>             running? Please help.
>      >>         >>
>      >>         >>             I have installed torque using this
>      >>         configuration option.
>      >>         >>             *./configure --enable-unixsockets
>     --enable-cpuset
>      >>         >>             --enable-geometry-requests
>     --enable-numa-support *
>      >>         >>
>      >>         >>
>      >>         >>
>      >>         >>
>      >>         >>             Thanks,
>      >>         >>             Hitesh Chugani.
>      >>         >>             Student Linux specialist
>      >>         >>             University of North Carolina, Charlotte
>      >>         >> _______________________________________________
>      >>         >>
>      >>         >>             torqueusers mailing list
>      >>         >> torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >>         >>             <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>>
>      >>         >> http://www.supercluster.org/mailman/listinfo/torqueusers
>      >>         >
>      >>         > _______________________________________________
>      >>         >             torqueusers mailing list
>      >>         > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >>         >             <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>>
>      >>         > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >>         >
>      >>         >
>      >>         >
>      >>         >
>      >>         >         --
>      >>         >         Ken Nielson
>      >>         > +1 801.717.3700 <tel:%2B1%20801.717.3700>
>     <tel:%2B1%20801.717.3700>
>      >>         <tel:%2B1%20801.717.3700> office +1 801.717.3738
>      >>         <tel:%2B1%20801.717.3738>
>      >>         >         <tel:%2B1%20801.717.3738> fax
>      >>         >         1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>      >>         > www.adaptivecomputing.com
>     <http://www.adaptivecomputing.com> <http://www.adaptivecomputing.com>
>      >>         <http://www.adaptivecomputing.com>
>      >>         >
>      >>         >
>      >>         >         _______________________________________________
>      >>         >         torqueusers mailing list
>      >>         > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>>
>      >>         > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >>         >
>      >>         >
>      >>         >
>      >>         >
>      >>         >
>      >>         > _______________________________________________
>      >>         > torqueusers mailing list
>      >>         > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >>         <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >>         > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >>         >
>      >>
>      >>         _______________________________________________
>      >>         torqueusers mailing list
>      >> torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >> http://www.supercluster.org/mailman/listinfo/torqueusers
>      >>
>      >>
>      >>
>      >>
>      >>     _______________________________________________
>      >>     torqueusers mailing list
>      >> torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>       <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >> http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >     --
>      >     Sven Schumacher - Systemadministrator Tel: (0511)762-2753
>      >     Leibniz Universitaet Hannover
>      >     Institut für Turbomaschinen und Fluid-Dynamik       - TFD
>      >     Appelstraße 9 - 30167 Hannover
>      >     Institut für Kraftwerkstechnik und Wärmeübertragung - IKW
>      >     Callinstraße 36 - 30167 Hannover
>      >
>      >
>      >     _______________________________________________
>      >     torqueusers mailing list
>      > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >
>      >
>      > _______________________________________________
>      > torqueusers mailing list
>      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list