[torqueusers] PBS Scheduling Weirdness

Edsall, William (WJ) WJEdsall at dow.com
Wed May 20 11:57:58 MDT 2009


Thank you for the thought out responses, we've solved the issue.
 
Unfortunately the pbs_server was locked up and i did not know it.
restart attempts were appearing to work but we ended up killing it -9.
After restarting it, the jobs are working well.
 
Thanks again.


________________________________

	From: Jerry Smith [mailto:jdsmit at sandia.gov] 
	Sent: Wednesday, May 20, 2009 1:29 PM
	To: Edsall, William (WJ)
	Cc: torqueusers at supercluster.org
	Subject: Re: [torqueusers] PBS Scheduling Weirdness
	
	
	Try: 
	echo "sleep 10" | qsub -l nodes=node4:ppn=4
	or 
	echo "sleep 10" | qsub -l nodes=1:ppn=4
	
	Does this change anything?
	
	--Jerry
	
	Edsall, William (WJ) wrote: 

		I usually test with a STDIN command such as this. 
		 
		> echo "sleep 10" | qsub -l nodes=1:node4:ppn=4
		 
		My job runs, but as you can see i only get one cpu, on
the wrong resource. This is the same as requesting multiple nodes. This
was working and works on our other clusters but as of monday this week
it fails.
		 
		> qstat -f 1059
		Job Id: 1059
		    Job_Name = STDIN
		    Job_Owner =  <deleted>
		    job_state = R
		    queue = batch
		    server = <deleted>com
		    Checkpoint = u
		    ctime = Wed May 20 11:46:18 2009
		    Error_Path = <deleted>
		    exec_host = node2/0
		    Hold_Types = n
		    Join_Path = n
		    Keep_Files = n
		    Mail_Points = a
		    mtime = Wed May 20 11:46:26 2009
		    Output_Path = <deleted>/STDIN.o1059
		    Priority = 0
		    qtime = Wed May 20 11:46:18 2009
		    Rerunable = True
		    Resource_List.neednodes = 1
		    Resource_List.nodect = 1
		    Resource_List.nodes = 1
		    Resource_List.walltime = 01:00:00
		    session_id = 12814
		    substate = 42
		    Variable_List =
PBS_O_HOME=/home/<deleted>,PBS_O_LANG=POSIX,
		        PBS_O_LOGNAME=<deleted>,
	
PBS_O_PATH=/usr/local/torque/sbin:/usr/local/torque/bin:/usr/bin:/bin
	
:/usr/sbin:/sbin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games
	
:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin,
	
PBS_O_MAIL=/var/mail/<deleted>,PBS_O_SHELL=/bin/tcsh,
	
PBS_SERVER=txmerig.nam.dow.com,PBS_O_HOST=txmerig.nam.dow.com,
		        PBS_O_WORKDIR=/home/<deleted>,PBS_O_QUEUE=batch
		    euser = <deleted>
		    egroup = users
		    hashname = 1059.<deleted>.com
		    queue_rank = 996
		    queue_type = E
		    comment = Job started on Wed May 20 at 11:46
		    etime = Wed May 20 11:46:18 2009
		    submit_args = -l nodes=1:node4:ppn=4
		    start_time = Wed May 20 11:46:26 2009
		    start_count = 1
		
		 
		on other known working clusters, requesting resources in
the same fasion works fine as seen here:
		    exec_host =
node14/3+node14/2+node14/1+node14/0+node13/3+node13/2+node13/1
		        +node13/0
		
		

________________________________

			From: Jerry Smith [mailto:jdsmit at sandia.gov] 
			Sent: Wednesday, May 20, 2009 12:02 PM
			To: Edsall, William (WJ)
			Cc: torqueusers at supercluster.org
			Subject: Re: [torqueusers] PBS Scheduling
Weirdness
			
			
			Sorry I forgot to ask this as well, can we get a
copy of the script you are submitting and the qsub command you are
using?
			
			Jerry
			
			Edsall, William (WJ) wrote: 

				Hello,
				 Here is the output. I'm using the
torque scheduler - maui is on the system but not running.
				 
				# qmgr -c "p s"
				#
				# Create queues and set their
attributes.
				#
				#
				# Create and define queue batch
				#
				create queue batch
				set queue batch queue_type = Execution
				set queue batch resources_default.nodes
= 1
				set queue batch
resources_default.walltime = 01:00:00
				set queue batch enabled = True
				set queue batch started = True
				#
				# Set server attributes.
				#
				set server scheduling = True
				set server acl_hosts = txmerig
				//stripped out the list of managers and
operators
				set server default_queue = batch
				set server log_events = 511
				set server mail_from = adm
				set server scheduler_iteration = 600
				set server node_check_rate = 150
				set server tcp_timeout = 6
				set server next_job_number = 1054
				


________________________________

				From: Jerry Smith
[mailto:jdsmit at sandia.gov] 
				Sent: Tuesday, May 19, 2009 4:05 PM
				To: Edsall, William (WJ)
				Cc: torqueusers at supercluster.org
				Subject: Re: [torqueusers] PBS
Scheduling Weirdness
				
				
				Can you give us the output from:
				
				qmgr -c "p s" 
				
				and are you using any external
scheduler, Maui or Moab or the like?
				
				Thanks,
				
				--Jerry
				
				Edsall, William (WJ) wrote: 

				Hello list, 
				 Having a strange problem with torque
version: 2.4.0b1. 

				It seems that no matter how much
resource I request, I only get one cpu on the first available node. 

				Please help me brainstorm the possible
causes. 
				
				_______________________________________
				William J. Edsall
				


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090520/1d5f98cf/attachment-0001.html 


More information about the torqueusers mailing list