[torqueusers] Basic torque config

Grigory Shamov gas5x at yahoo.com
Tue Feb 14 08:36:59 MST 2012


Do you have a scheduler installed? Like, Maui, Moab?

--- On Tue, 2/14/12, Christina Salls <christina.salls at noaa.gov> wrote:

From: Christina Salls <christina.salls at noaa.gov>
Subject: [torqueusers] Basic torque config
To: "Torque Users Mailing List" <torqueusers at supercluster.org>, "Brian Beagan" <beagan at sgi.com>, "John Cardenas" <cardenas at sgi.com>, "Jeff Hanson" <jhanson at sgi.com>, "Michael Saxon" <saxonm at sgi.com>, "help >> GLERL IT Help" <oar.glerl.it-help at noaa.gov>, keenandr at msu.edu
Date: Tuesday, February 14, 2012, 6:36 AM

Hi all,
      I finally made some progress but am not all the way there yet.  I changed the hostname of the server to admin, which is the hostname assigned to the interface that the compute nodes are physically connected to.  Now my pbsnodes command shows the nodes as free!!

[root at wings torque]# pbsnodes -an001.default.domain     state = free     np = 1     ntype = cluster     status = rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux
     gpus = 0
n002.default.domain     state = free     np = 1     ntype = cluster     status = rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux
     gpus = 0 ....For all 20 nodes.
And now when I submit a job, I get a job id back, however, the jobs stays in the queue state.  
-bash-4.1$ ./example_submit_script_1 
Fri Feb 10 15:46:35 CST 2012Fri Feb 10 15:46:45 CST 2012-bash-4.1$ ./example_submit_script_1 | qsub6.admin.default.domain-bash-4.1$ qstatJob id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----4.wings                    STDIN            salls                  0 Q batch          5.wings                    STDIN            salls                  0 Q batch          
6.admin                    STDIN            salls                  0 Q batch  
I deleted the two jobs that were created when wings was the server in case they were getting in the way

[root at wings torque]# qstatJob id                    Name             User            Time Use S Queue------------------------- ---------------- --------------- -------- - -----
6.admin                    STDIN            salls                  0 Q batch          [root at wings torque]# qstat -a
admin.default.domain:                                                                          Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----6.admin.default.     salls    batch    STDIN               --    --   --    --    --  Q   -- 
[root at wings torque]#         
I don't see anything that seems significant in the logs:
Lots of entries like this in the server log:
02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 002/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 002/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0

This is the entirety of the sched_log:
02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120210 opened
02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 1257602/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 684851202/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened
02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address already in use (98) in main, bind02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed

mom logs on the compute nodes have the same multiple entries:
02/14/2012 08:03:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:08:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0
02/14/2012 08:13:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:18:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:23:00;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0

ps looks like this:
-bash-4.1$ ps -ef | grep pbsroot     12576     1  0 Feb10 ?        00:00:00 pbs_schedsalls    12727 26862  0 08:19 pts/0    00:00:00 grep pbs
root     25810     1  0 Feb10 ?        00:00:25 pbs_server -H admin.default.domain
The server and queue settings are as follows:
Qmgr: list server
Server admin.default.domain	server_state = Active	scheduling = True	total_jobs = 1
	state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 	acl_hosts = admin.default.domain,wings.glerl.noaa.gov
	default_queue = batch	log_events = 511	mail_from = adm
	scheduler_iteration = 600	node_check_rate = 150	tcp_timeout = 6
	mom_job_sync = True	pbs_version = 2.5.9	keep_completed = 300
	next_job_number = 7	net_counter = 1 0 0
Qmgr: list queue batch
Queue batch	queue_type = Execution	Priority = 100	total_jobs = 1
	state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 	max_running = 300
	mtime = Thu Feb  9 18:22:33 2012	enabled = True	started = True

Do I need to create a routing queue?  It seems like I am missing a basic element here.  
Thanks in advance,
Christina



-- 
Christina A. SallsGLERL Computer Grouphelp.glerl at noaa.govHelp Desk x2127Christina.Salls at noaa.gov
Voice Mail 734-741-2446 




-----Inline Attachment Follows-----

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/5d2230b7/attachment-0001.html 


More information about the torqueusers mailing list