[torqueusers] Basic torque config

Christina Salls christina.salls at noaa.gov
Tue Feb 14 08:54:45 MST 2012


On Tue, Feb 14, 2012 at 10:36 AM, Grigory Shamov <gas5x at yahoo.com> wrote:

> Do you have a scheduler installed? Like, Maui, Moab?
>

No I don't.  My plan is to run Torque on a single cluster with one head
node and 20 compute nodes.  The user base is currently around 5 and may
increase to 10.  We are simply trying to manage the resource (in probably a
FIFO manner)  I was hoping to get away with the Torque scheduler because of
the simplicity of the config.  Do you think that is possible?

>
>
> --- On *Tue, 2/14/12, Christina Salls <christina.salls at noaa.gov>* wrote:
>
>
> From: Christina Salls <christina.salls at noaa.gov>
> Subject: [torqueusers] Basic torque config
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>, "Brian
> Beagan" <beagan at sgi.com>, "John Cardenas" <cardenas at sgi.com>, "Jeff
> Hanson" <jhanson at sgi.com>, "Michael Saxon" <saxonm at sgi.com>, "help >>
> GLERL IT Help" <oar.glerl.it-help at noaa.gov>, keenandr at msu.edu
> Date: Tuesday, February 14, 2012, 6:36 AM
>
>
> Hi all,
>
>       I finally made some progress but am not all the way there yet.  I
> changed the hostname of the server to admin, which is the hostname assigned
> to the interface that the compute nodes are physically connected to.  Now
> my pbsnodes command shows the nodes as free!!
>
> [root at wings torque]# pbsnodes -a
> n001.default.domain
>      state = free
>      np = 1
>      ntype = cluster
>      status =
> rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=?
> 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
> 10 15:42:40 EDT 2011 x86_64,opsys=linux
>      gpus = 0
>
> n002.default.domain
>      state = free
>      np = 1
>      ntype = cluster
>      status =
> rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=?
> 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
> 10 15:42:40 EDT 2011 x86_64,opsys=linux
>      gpus = 0
>
> ....For all 20 nodes.
>
> And now when I submit a job, I get a job id back, however, the jobs stays
> in the queue state.
>
> -bash-4.1$ ./example_submit_script_1
> Fri Feb 10 15:46:35 CST 2012
> Fri Feb 10 15:46:45 CST 2012
> -bash-4.1$ ./example_submit_script_1 | qsub
> 6.admin.default.domain
> -bash-4.1$ qstat
> Job id                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 4.wings                    STDIN            salls                  0 Q
> batch
> 5.wings                    STDIN            salls                  0 Q
> batch
> 6.admin                    STDIN            salls                  0 Q
> batch
>
> I deleted the two jobs that were created when wings was the server in case
> they were getting in the way
>
> [root at wings torque]# qstat
> Job id                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 6.admin                    STDIN            salls                  0 Q
> batch
> [root at wings torque]# qstat -a
>
> admin.default.domain:
>
>  Req'd  Req'd   Elap
> Job ID               Username Queue    Jobname          SessID NDS   TSK
> Memory Time  S Time
> -------------------- -------- -------- ---------------- ------ ----- ---
> ------ ----- - -----
> 6.admin.default.     salls    batch    STDIN               --    --   --
>  --    --  Q   --
> [root at wings torque]#
>
>
> I don't see anything that seems significant in the logs:
>
> Lots of entries like this in the server log:
> 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
> 2.5.9, loglevel = 0
> 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
> 2.5.9, loglevel = 0
> 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
> 2.5.9, loglevel = 0
>
> This is the entirety of the sched_log:
>
> 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened
> 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file
> /var/spool/torque/sched_priv/accounting/20120210 opened
> 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576
> 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512
> 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened
> 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address
> already in use (98) in main, bind
> 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination
> 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed
>
> mom logs on the compute nodes have the same multiple entries:
>
> 02/14/2012 08:03:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> 02/14/2012 08:08:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> 02/14/2012 08:13:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> 02/14/2012 08:18:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> 02/14/2012 08:23:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
>
> ps looks like this:
>
> -bash-4.1$ ps -ef | grep pbs
> root     12576     1  0 Feb10 ?        00:00:00 pbs_sched
> salls    12727 26862  0 08:19 pts/0    00:00:00 grep pbs
> root     25810     1  0 Feb10 ?        00:00:25 pbs_server -H
> admin.default.domain
>
> The server and queue settings are as follows:
>
> Qmgr: list server
> Server admin.default.domain
> server_state = Active
> scheduling = True
> total_jobs = 1
>  state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
> acl_hosts = admin.default.domain,wings.glerl.noaa.gov
>  default_queue = batch
> log_events = 511
> mail_from = adm
>  scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
>  mom_job_sync = True
> pbs_version = 2.5.9
> keep_completed = 300
>  next_job_number = 7
> net_counter = 1 0 0
>
> Qmgr: list queue batch
> Queue batch
> queue_type = Execution
> Priority = 100
> total_jobs = 1
>  state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
> max_running = 300
>  mtime = Thu Feb  9 18:22:33 2012
> enabled = True
> started = True
>
> Do I need to create a routing queue?  It seems like I am missing a basic
> element here.
>
> Thanks in advance,
>
> Christina
>
>
>
> --
> Christina A. Salls
> GLERL Computer Group
> help.glerl at noaa.gov <http://mc/compose?to=help.glerl@noaa.gov>
> Help Desk x2127
> Christina.Salls at noaa.gov <http://mc/compose?to=Christina.Salls@noaa.gov>
> Voice Mail 734-741-2446
>
>
>
> -----Inline Attachment Follows-----
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org<http://mc/compose?to=torqueusers@supercluster.org>
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/e12907b7/attachment.html 


More information about the torqueusers mailing list