[torqueusers] Basic torque config

Christina Salls christina.salls at noaa.gov
Wed Feb 15 08:02:19 MST 2012


On Tue, Feb 14, 2012 at 1:43 PM, Gustavo Correa <gus at ldeo.columbia.edu>wrote:

>
> On Feb 14, 2012, at 1:28 PM, Christina Salls wrote:
>
> >
> >
> > On Tue, Feb 14, 2012 at 1:24 PM, Gustavo Correa <gus at ldeo.columbia.edu>
> wrote:
> > Make sure pbs_sched [Xor alternatively maui, if you installed it] is
> running.
> >
> > Thanks for the response.
> >
> > It appears to be running.
> >
> > [root at wings etc]# ps -ef | grep pbs
> > root      6896  6509  0 12:25 pts/24   00:00:00 grep pbs
> > root     12576     1  0 Feb10 ?        00:00:00 pbs_sched
> > root     25810     1  0 Feb10 ?        00:00:26 pbs_server -H
> admin.default.domain
> >
> >
> > Also, as root, on the pbs_server computer, enable scheduling:
> > qmgr -c 'set server scheduling=True'
> >
> > And it appears that server scheduling is already set for True
> >
> > [root at wings etc]# qmgr
> > Max open servers: 10239
> > Qmgr: print server
> > #
> > # Create queues and set their attributes.
> > #
> > #
> > # Create and define queue batch
> > #
> > create queue batch
> > set queue batch queue_type = Execution
> > set queue batch Priority = 100
> > set queue batch max_running = 300
> > set queue batch enabled = True
> > set queue batch started = True
> > #
> > # Set server attributes.
> > #
> > set server scheduling = True
> > set server acl_hosts = admin.default.domain
> > set server acl_hosts += wings.glerl.noaa.gov
> > set server default_queue = batch
> > set server log_events = 511
> > set server mail_from = adm
> > set server scheduler_iteration = 600
> > set server node_check_rate = 150
> > set server tcp_timeout = 6
> > set server mom_job_sync = True
> > set server keep_completed = 300
> > set server next_job_number = 8
> >
>
> If you made changes in the nodes file, etc, restart the server, etc, just
> in case:
> service pbs_server restart
> service pbs_sched restart
> service pbs_mom restart [this one on the compute nodes]
>
> I restarted the whole cluster after I put the scripts in /etc/init.d, to
make sure everything came back up.


> Then check the pbs_server logs [$TORQUE/server_logs]
>

This is what the server log looks like when I submit a job:

02/14/2012 15:11:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 15:16:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 15:18:39;0100;PBS_Server;Job;8.admin.default.domain;enqueuing
into batch, state 1 hop 1
02/14/2012 15:18:39;0008;PBS_Server;Job;8.admin.default.domain;Job Queued
at request of salls at admin.default.domain, owner = salls at admi
n.default.domain, job name = STDIN, queue = batch
02/14/2012 15:21:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 15:26:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 15:31:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0

and the system logs in the computer where
> pbs_server runs [/var/log/messages].
>

Good idea.  This is what was happening about the time I submitted the job:

Feb 14 15:10:00 n004 smartd[4177]: smartd has fork()ed into background
mode. New PID=4177.
Feb 14 15:17:51 wings xinetd[3137]: EXIT: tftp status=0 pid=5898
duration=903(sec)
Feb 14 15:26:48 wings avahi-daemon[2566]: Invalid query packet.
Feb 14 15:26:48 wings avahi-daemon[2566]: Invalid query packet.
Feb 14 15:26:48 wings avahi-daemon[2566]: Invalid query packet.


> There may be messages in either one with hints about the actual problem.
>
> > By the way, what is the best way to get both the server and scheduler to
> start at run time?
> >
>
> It depends on your OS and Linux distribution.
> Normally you put the pbs_sched and pbs_server scripts in /etc/init.d
>  [they come in the Torque 'contrib' directory, I think, but if you
> installed from RPMs or
> other packages they may already be there].
> On the compute nodes you put pbs_mom there.
> If your pbs_server computer will also be used as a compute node, add
> pbs_mom there too.
> Then schedule them to start at init/boot time with chkconfig [which the
> Fedora folks
> bundled now into something called systemctl, in case you use Fedora].
>

Thanks!  I found the scripts and copied them to /etc/init.d and used
chkconfig to turn them on.  I am running RHEL 6.2.


>
> I hope it helps,
> Gus Correa
>
>
> > I hope this helps,
> > Gus Correa
> >
> > On Feb 14, 2012, at 10:36 AM, Grigory Shamov wrote:
> >
> > > Do you have a scheduler installed? Like, Maui, Moab?
> > >
> > >
> > >
> > >
> > > --- On Tue, 2/14/12, Christina Salls <christina.salls at noaa.gov> wrote:
> > >
> > > From: Christina Salls <christina.salls at noaa.gov>
> > > Subject: [torqueusers] Basic torque config
> > > To: "Torque Users Mailing List" <torqueusers at supercluster.org>,
> "Brian Beagan" <beagan at sgi.com>, "John Cardenas" <cardenas at sgi.com>,
> "Jeff Hanson" <jhanson at sgi.com>, "Michael Saxon" <saxonm at sgi.com>, "help
> >> GLERL IT Help" <oar.glerl.it-help at noaa.gov>, keenandr at msu.edu
> > > Date: Tuesday, February 14, 2012, 6:36 AM
> > >
> > > Hi all,
> > >
> > >       I finally made some progress but am not all the way there yet.
>  I changed the hostname of the server to admin, which is the hostname
> assigned to the interface that the compute nodes are physically connected
> to.  Now my pbsnodes command shows the nodes as free!!
> > >
> > > [root at wings torque]# pbsnodes -a
> > > n001.default.domain
> > >      state = free
> > >      np = 1
> > >      ntype = cluster
> > >      status =
> rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=?
> 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
> 10 15:42:40 EDT 2011 x86_64,opsys=linux
> > >      gpus = 0
> > >
> > > n002.default.domain
> > >      state = free
> > >      np = 1
> > >      ntype = cluster
> > >      status =
> rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=?
> 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
> 10 15:42:40 EDT 2011 x86_64,opsys=linux
> > >      gpus = 0
> > >
> > > ....For all 20 nodes.
> > >
> > > And now when I submit a job, I get a job id back, however, the jobs
> stays in the queue state.
> > >
> > > -bash-4.1$ ./example_submit_script_1
> > > Fri Feb 10 15:46:35 CST 2012
> > > Fri Feb 10 15:46:45 CST 2012
> > > -bash-4.1$ ./example_submit_script_1 | qsub
> > > 6.admin.default.domain
> > > -bash-4.1$ qstat
> > > Job id                    Name             User            Time Use S
> Queue
> > > ------------------------- ---------------- --------------- -------- -
> -----
> > > 4.wings                    STDIN            salls                  0 Q
> batch
> > > 5.wings                    STDIN            salls                  0 Q
> batch
> > > 6.admin                    STDIN            salls                  0 Q
> batch
> > >
> > > I deleted the two jobs that were created when wings was the server in
> case they were getting in the way
> > >
> > > [root at wings torque]# qstat
> > > Job id                    Name             User            Time Use S
> Queue
> > > ------------------------- ---------------- --------------- -------- -
> -----
> > > 6.admin                    STDIN            salls                  0 Q
> batch
> > > [root at wings torque]# qstat -a
> > >
> > > admin.default.domain:
> > >
>    Req'd  Req'd   Elap
> > > Job ID               Username Queue    Jobname          SessID NDS
> TSK Memory Time  S Time
> > > -------------------- -------- -------- ---------------- ------ -----
> --- ------ ----- - -----
> > > 6.admin.default.     salls    batch    STDIN               --    --
> --    --    --  Q   --
> > > [root at wings torque]#
> > >
> > >
> > > I don't see anything that seems significant in the logs:
> > >
> > > Lots of entries like this in the server log:
> > > 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server
> Version = 2.5.9, loglevel = 0
> > > 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server
> Version = 2.5.9, loglevel = 0
> > > 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server
> Version = 2.5.9, loglevel = 0
> > >
> > > This is the entirety of the sched_log:
> > >
> > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened
> > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file
> /var/spool/torque/sched_priv/accounting/20120210 opened
> > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid
> 12576
> > > 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512
> > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened
> > > 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address
> already in use (98) in main, bind
> > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination
> > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed
> > >
> > > mom logs on the compute nodes have the same multiple entries:
> > >
> > > 02/14/2012 08:03:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > > 02/14/2012 08:08:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > > 02/14/2012 08:13:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > > 02/14/2012 08:18:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > > 02/14/2012 08:23:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.5.9, loglevel = 0
> > >
> > > ps looks like this:
> > >
> > > -bash-4.1$ ps -ef | grep pbs
> > > root     12576     1  0 Feb10 ?        00:00:00 pbs_sched
> > > salls    12727 26862  0 08:19 pts/0    00:00:00 grep pbs
> > > root     25810     1  0 Feb10 ?        00:00:25 pbs_server -H
> admin.default.domain
> > >
> > > The server and queue settings are as follows:
> > >
> > > Qmgr: list server
> > > Server admin.default.domain
> > >       server_state = Active
> > >       scheduling = True
> > >       total_jobs = 1
> > >       state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0
> Exiting:0
> > >       acl_hosts = admin.default.domain,wings.glerl.noaa.gov
> > >       default_queue = batch
> > >       log_events = 511
> > >       mail_from = adm
> > >       scheduler_iteration = 600
> > >       node_check_rate = 150
> > >       tcp_timeout = 6
> > >       mom_job_sync = True
> > >       pbs_version = 2.5.9
> > >       keep_completed = 300
> > >       next_job_number = 7
> > >       net_counter = 1 0 0
> > >
> > > Qmgr: list queue batch
> > > Queue batch
> > >       queue_type = Execution
> > >       Priority = 100
> > >       total_jobs = 1
> > >       state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0
> Exiting:0
> > >       max_running = 300
> > >       mtime = Thu Feb  9 18:22:33 2012
> > >       enabled = True
> > >       started = True
> > >
> > > Do I need to create a routing queue?  It seems like I am missing a
> basic element here.
> > >
> > > Thanks in advance,
> > >
> > > Christina
> > >
> > >
> > >
> > > --
> > > Christina A. Salls
> > > GLERL Computer Group
> > > help.glerl at noaa.gov
> > > Help Desk x2127
> > > Christina.Salls at noaa.gov
> > > Voice Mail 734-741-2446
> > >
> > >
> > >
> > > -----Inline Attachment Follows-----
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > --
> > Christina A. Salls
> > GLERL Computer Group
> > help.glerl at noaa.gov
> > Help Desk x2127
> > Christina.Salls at noaa.gov
> > Voice Mail 734-741-2446
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120215/e539fde0/attachment-0001.html 


More information about the torqueusers mailing list