[torqueusers] Basic torque config

Christina Salls christina.salls at noaa.gov
Tue Feb 14 11:28:32 MST 2012


On Tue, Feb 14, 2012 at 1:24 PM, Gustavo Correa <gus at ldeo.columbia.edu>wrote:

> Make sure pbs_sched [Xor alternatively maui, if you installed it] is
> running.
>

Thanks for the response.

It appears to be running.

[root at wings etc]# ps -ef | grep pbs
root      6896  6509  0 12:25 pts/24   00:00:00 grep pbs
root     12576     1  0 Feb10 ?        00:00:00 pbs_sched
root     25810     1  0 Feb10 ?        00:00:26 pbs_server -H
admin.default.domain


>
> Also, as root, on the pbs_server computer, enable scheduling:
> qmgr -c 'set server scheduling=True'
>

And it appears that server scheduling is already set for True

[root at wings etc]# qmgr
Max open servers: 10239
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch Priority = 100
set queue batch max_running = 300
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = admin.default.domain
set server acl_hosts += wings.glerl.noaa.gov
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 8

By the way, what is the best way to get both the server and scheduler to
start at run time?

>
> I hope this helps,
> Gus Correa
>
> On Feb 14, 2012, at 10:36 AM, Grigory Shamov wrote:
>
> > Do you have a scheduler installed? Like, Maui, Moab?
> >
> >
> >
> >
> > --- On Tue, 2/14/12, Christina Salls <christina.salls at noaa.gov> wrote:
> >
> > From: Christina Salls <christina.salls at noaa.gov>
> > Subject: [torqueusers] Basic torque config
> > To: "Torque Users Mailing List" <torqueusers at supercluster.org>, "Brian
> Beagan" <beagan at sgi.com>, "John Cardenas" <cardenas at sgi.com>, "Jeff
> Hanson" <jhanson at sgi.com>, "Michael Saxon" <saxonm at sgi.com>, "help >>
> GLERL IT Help" <oar.glerl.it-help at noaa.gov>, keenandr at msu.edu
> > Date: Tuesday, February 14, 2012, 6:36 AM
> >
> > Hi all,
> >
> >       I finally made some progress but am not all the way there yet.  I
> changed the hostname of the server to admin, which is the hostname assigned
> to the interface that the compute nodes are physically connected to.  Now
> my pbsnodes command shows the nodes as free!!
> >
> > [root at wings torque]# pbsnodes -a
> > n001.default.domain
> >      state = free
> >      np = 1
> >      ntype = cluster
> >      status =
> rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=?
> 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
> 10 15:42:40 EDT 2011 x86_64,opsys=linux
> >      gpus = 0
> >
> > n002.default.domain
> >      state = free
> >      np = 1
> >      ntype = cluster
> >      status =
> rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=?
> 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
> 10 15:42:40 EDT 2011 x86_64,opsys=linux
> >      gpus = 0
> >
> > ....For all 20 nodes.
> >
> > And now when I submit a job, I get a job id back, however, the jobs
> stays in the queue state.
> >
> > -bash-4.1$ ./example_submit_script_1
> > Fri Feb 10 15:46:35 CST 2012
> > Fri Feb 10 15:46:45 CST 2012
> > -bash-4.1$ ./example_submit_script_1 | qsub
> > 6.admin.default.domain
> > -bash-4.1$ qstat
> > Job id                    Name             User            Time Use S
> Queue
> > ------------------------- ---------------- --------------- -------- -
> -----
> > 4.wings                    STDIN            salls                  0 Q
> batch
> > 5.wings                    STDIN            salls                  0 Q
> batch
> > 6.admin                    STDIN            salls                  0 Q
> batch
> >
> > I deleted the two jobs that were created when wings was the server in
> case they were getting in the way
> >
> > [root at wings torque]# qstat
> > Job id                    Name             User            Time Use S
> Queue
> > ------------------------- ---------------- --------------- -------- -
> -----
> > 6.admin                    STDIN            salls                  0 Q
> batch
> > [root at wings torque]# qstat -a
> >
> > admin.default.domain:
> >
>  Req'd  Req'd   Elap
> > Job ID               Username Queue    Jobname          SessID NDS   TSK
> Memory Time  S Time
> > -------------------- -------- -------- ---------------- ------ ----- ---
> ------ ----- - -----
> > 6.admin.default.     salls    batch    STDIN               --    --   --
>    --    --  Q   --
> > [root at wings torque]#
> >
> >
> > I don't see anything that seems significant in the logs:
> >
> > Lots of entries like this in the server log:
> > 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
> = 2.5.9, loglevel = 0
> > 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
> = 2.5.9, loglevel = 0
> > 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
> = 2.5.9, loglevel = 0
> >
> > This is the entirety of the sched_log:
> >
> > 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened
> > 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file
> /var/spool/torque/sched_priv/accounting/20120210 opened
> > 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576
> > 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512
> > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened
> > 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address
> already in use (98) in main, bind
> > 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination
> > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed
> >
> > mom logs on the compute nodes have the same multiple entries:
> >
> > 02/14/2012 08:03:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > 02/14/2012 08:08:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > 02/14/2012 08:13:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > 02/14/2012 08:18:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > 02/14/2012 08:23:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> >
> > ps looks like this:
> >
> > -bash-4.1$ ps -ef | grep pbs
> > root     12576     1  0 Feb10 ?        00:00:00 pbs_sched
> > salls    12727 26862  0 08:19 pts/0    00:00:00 grep pbs
> > root     25810     1  0 Feb10 ?        00:00:25 pbs_server -H
> admin.default.domain
> >
> > The server and queue settings are as follows:
> >
> > Qmgr: list server
> > Server admin.default.domain
> >       server_state = Active
> >       scheduling = True
> >       total_jobs = 1
> >       state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0
> Exiting:0
> >       acl_hosts = admin.default.domain,wings.glerl.noaa.gov
> >       default_queue = batch
> >       log_events = 511
> >       mail_from = adm
> >       scheduler_iteration = 600
> >       node_check_rate = 150
> >       tcp_timeout = 6
> >       mom_job_sync = True
> >       pbs_version = 2.5.9
> >       keep_completed = 300
> >       next_job_number = 7
> >       net_counter = 1 0 0
> >
> > Qmgr: list queue batch
> > Queue batch
> >       queue_type = Execution
> >       Priority = 100
> >       total_jobs = 1
> >       state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0
> Exiting:0
> >       max_running = 300
> >       mtime = Thu Feb  9 18:22:33 2012
> >       enabled = True
> >       started = True
> >
> > Do I need to create a routing queue?  It seems like I am missing a basic
> element here.
> >
> > Thanks in advance,
> >
> > Christina
> >
> >
> >
> > --
> > Christina A. Salls
> > GLERL Computer Group
> > help.glerl at noaa.gov
> > Help Desk x2127
> > Christina.Salls at noaa.gov
> > Voice Mail 734-741-2446
> >
> >
> >
> > -----Inline Attachment Follows-----
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/75ddcf80/attachment-0001.html 


More information about the torqueusers mailing list