[torqueusers] Basic torque config
Christina Salls
christina.salls at noaa.gov
Tue Feb 14 07:36:35 MST 2012
Hi all,
I finally made some progress but am not all the way there yet. I
changed the hostname of the server to admin, which is the hostname assigned
to the interface that the compute nodes are physically connected to. Now
my pbsnodes command shows the nodes as free!!
[root at wings torque]# pbsnodes -a
n001.default.domain
state = free
np = 1
ntype = cluster
status =
rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
10 15:42:40 EDT 2011 x86_64,opsys=linux
gpus = 0
n002.default.domain
state = free
np = 1
ntype = cluster
status =
rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
10 15:42:40 EDT 2011 x86_64,opsys=linux
gpus = 0
....For all 20 nodes.
And now when I submit a job, I get a job id back, however, the jobs stays
in the queue state.
-bash-4.1$ ./example_submit_script_1
Fri Feb 10 15:46:35 CST 2012
Fri Feb 10 15:46:45 CST 2012
-bash-4.1$ ./example_submit_script_1 | qsub
6.admin.default.domain
-bash-4.1$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4.wings STDIN salls 0 Q
batch
5.wings STDIN salls 0 Q
batch
6.admin STDIN salls 0 Q
batch
I deleted the two jobs that were created when wings was the server in case
they were getting in the way
[root at wings torque]# qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
6.admin STDIN salls 0 Q
batch
[root at wings torque]# qstat -a
admin.default.domain:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
6.admin.default. salls batch STDIN -- -- --
-- -- Q --
[root at wings torque]#
I don't see anything that seems significant in the logs:
Lots of entries like this in the server log:
02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
2.5.9, loglevel = 0
This is the entirety of the sched_log:
02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened
02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file
/var/spool/torque/sched_priv/accounting/20120210 opened
02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576
02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512
02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened
02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address
already in use (98) in main, bind
02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination
02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed
mom logs on the compute nodes have the same multiple entries:
02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
02/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9,
loglevel = 0
ps looks like this:
-bash-4.1$ ps -ef | grep pbs
root 12576 1 0 Feb10 ? 00:00:00 pbs_sched
salls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs
root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H
admin.default.domain
The server and queue settings are as follows:
Qmgr: list server
Server admin.default.domain
server_state = Active
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
acl_hosts = admin.default.domain,wings.glerl.noaa.gov
default_queue = batch
log_events = 511
mail_from = adm
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.5.9
keep_completed = 300
next_job_number = 7
net_counter = 1 0 0
Qmgr: list queue batch
Queue batch
queue_type = Execution
Priority = 100
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
max_running = 300
mtime = Thu Feb 9 18:22:33 2012
enabled = True
started = True
Do I need to create a routing queue? It seems like I am missing a basic
element here.
Thanks in advance,
Christina
--
Christina A. Salls
GLERL Computer Group
help.glerl at noaa.gov
Help Desk x2127
Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/8eec4186/attachment.html
More information about the torqueusers
mailing list