[torqueusers] Basic torque config
Grigory Shamov
gas5x at yahoo.com
Tue Feb 14 08:36:59 MST 2012
Do you have a scheduler installed? Like, Maui, Moab?
--- On Tue, 2/14/12, Christina Salls <christina.salls at noaa.gov> wrote:
From: Christina Salls <christina.salls at noaa.gov>
Subject: [torqueusers] Basic torque config
To: "Torque Users Mailing List" <torqueusers at supercluster.org>, "Brian Beagan" <beagan at sgi.com>, "John Cardenas" <cardenas at sgi.com>, "Jeff Hanson" <jhanson at sgi.com>, "Michael Saxon" <saxonm at sgi.com>, "help >> GLERL IT Help" <oar.glerl.it-help at noaa.gov>, keenandr at msu.edu
Date: Tuesday, February 14, 2012, 6:36 AM
Hi all,
I finally made some progress but am not all the way there yet. I changed the hostname of the server to admin, which is the hostname assigned to the interface that the compute nodes are physically connected to. Now my pbsnodes command shows the nodes as free!!
[root at wings torque]# pbsnodes -an001.default.domain state = free np = 1 ntype = cluster status = rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux
gpus = 0
n002.default.domain state = free np = 1 ntype = cluster status = rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux
gpus = 0 ....For all 20 nodes.
And now when I submit a job, I get a job id back, however, the jobs stays in the queue state.
-bash-4.1$ ./example_submit_script_1
Fri Feb 10 15:46:35 CST 2012Fri Feb 10 15:46:45 CST 2012-bash-4.1$ ./example_submit_script_1 | qsub6.admin.default.domain-bash-4.1$ qstatJob id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----4.wings STDIN salls 0 Q batch 5.wings STDIN salls 0 Q batch
6.admin STDIN salls 0 Q batch
I deleted the two jobs that were created when wings was the server in case they were getting in the way
[root at wings torque]# qstatJob id Name User Time Use S Queue------------------------- ---------------- --------------- -------- - -----
6.admin STDIN salls 0 Q batch [root at wings torque]# qstat -a
admin.default.domain: Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----6.admin.default. salls batch STDIN -- -- -- -- -- Q --
[root at wings torque]#
I don't see anything that seems significant in the logs:
Lots of entries like this in the server log:
02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 002/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 002/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0
This is the entirety of the sched_log:
02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120210 opened
02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 1257602/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 684851202/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened
02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address already in use (98) in main, bind02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed
mom logs on the compute nodes have the same multiple entries:
02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
ps looks like this:
-bash-4.1$ ps -ef | grep pbsroot 12576 1 0 Feb10 ? 00:00:00 pbs_schedsalls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs
root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H admin.default.domain
The server and queue settings are as follows:
Qmgr: list server
Server admin.default.domain server_state = Active scheduling = True total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 acl_hosts = admin.default.domain,wings.glerl.noaa.gov
default_queue = batch log_events = 511 mail_from = adm
scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6
mom_job_sync = True pbs_version = 2.5.9 keep_completed = 300
next_job_number = 7 net_counter = 1 0 0
Qmgr: list queue batch
Queue batch queue_type = Execution Priority = 100 total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 max_running = 300
mtime = Thu Feb 9 18:22:33 2012 enabled = True started = True
Do I need to create a routing queue? It seems like I am missing a basic element here.
Thanks in advance,
Christina
--
Christina A. SallsGLERL Computer Grouphelp.glerl at noaa.govHelp Desk x2127Christina.Salls at noaa.gov
Voice Mail 734-741-2446
-----Inline Attachment Follows-----
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/5d2230b7/attachment-0001.html
More information about the torqueusers
mailing list