[torqueusers] Basic torque config

Christina Salls christina.salls at noaa.gov
Tue Feb 14 07:36:35 MST 2012


Hi all,

      I finally made some progress but am not all the way there yet.  I
changed the hostname of the server to admin, which is the hostname assigned
to the interface that the compute nodes are physically connected to.  Now
my pbsnodes command shows the nodes as free!!

[root at wings torque]# pbsnodes -a
n001.default.domain
     state = free
     np = 1
     ntype = cluster
     status =
rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
10 15:42:40 EDT 2011 x86_64,opsys=linux
     gpus = 0

n002.default.domain
     state = free
     np = 1
     ntype = cluster
     status =
rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
10 15:42:40 EDT 2011 x86_64,opsys=linux
     gpus = 0

....For all 20 nodes.

And now when I submit a job, I get a job id back, however, the jobs stays
in the queue state.

-bash-4.1$ ./example_submit_script_1
Fri Feb 10 15:46:35 CST 2012
Fri Feb 10 15:46:45 CST 2012
-bash-4.1$ ./example_submit_script_1 | qsub
6.admin.default.domain
-bash-4.1$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4.wings                    STDIN            salls                  0 Q
batch
5.wings                    STDIN            salls                  0 Q
batch
6.admin                    STDIN            salls                  0 Q
batch

I deleted the two jobs that were created when wings was the server in case
they were getting in the way

[root at wings torque]# qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
6.admin                    STDIN            salls                  0 Q
batch
[root at wings torque]# qstat -a

admin.default.domain:

 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
6.admin.default.     salls    batch    STDIN               --    --   --
 --    --  Q   --
[root at wings torque]#


I don't see anything that seems significant in the logs:

Lots of entries like this in the server log:
02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0

This is the entirety of the sched_log:

02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened
02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file
/var/spool/torque/sched_priv/accounting/20120210 opened
02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576
02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512
02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened
02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address
already in use (98) in main, bind
02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination
02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed

mom logs on the compute nodes have the same multiple entries:

02/14/2012 08:03:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:08:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:13:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:18:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:23:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0

ps looks like this:

-bash-4.1$ ps -ef | grep pbs
root     12576     1  0 Feb10 ?        00:00:00 pbs_sched
salls    12727 26862  0 08:19 pts/0    00:00:00 grep pbs
root     25810     1  0 Feb10 ?        00:00:25 pbs_server -H
admin.default.domain

The server and queue settings are as follows:

Qmgr: list server
Server admin.default.domain
server_state = Active
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
acl_hosts = admin.default.domain,wings.glerl.noaa.gov
default_queue = batch
log_events = 511
mail_from = adm
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.5.9
keep_completed = 300
next_job_number = 7
net_counter = 1 0 0

Qmgr: list queue batch
Queue batch
queue_type = Execution
Priority = 100
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
max_running = 300
mtime = Thu Feb  9 18:22:33 2012
enabled = True
started = True

Do I need to create a routing queue?  It seems like I am missing a basic
element here.

Thanks in advance,

Christina



-- 
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/8eec4186/attachment.html 


More information about the torqueusers mailing list