[torqueusers] PBS Scheduling Weirdness

Edsall, William (WJ) WJEdsall at dow.com
Wed May 20 10:18:01 MDT 2009


I usually test with a STDIN command such as this. 
 
> echo "sleep 10" | qsub -l nodes=1:node4:ppn=4
 
My job runs, but as you can see i only get one cpu, on the wrong
resource. This is the same as requesting multiple nodes. This was
working and works on our other clusters but as of monday this week it
fails.
 
> qstat -f 1059
Job Id: 1059
    Job_Name = STDIN
    Job_Owner =  <deleted>
    job_state = R
    queue = batch
    server = <deleted>com
    Checkpoint = u
    ctime = Wed May 20 11:46:18 2009
    Error_Path = <deleted>
    exec_host = node2/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed May 20 11:46:26 2009
    Output_Path = <deleted>/STDIN.o1059
    Priority = 0
    qtime = Wed May 20 11:46:18 2009
    Rerunable = True
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 01:00:00
    session_id = 12814
    substate = 42
    Variable_List = PBS_O_HOME=/home/<deleted>,PBS_O_LANG=POSIX,
        PBS_O_LOGNAME=<deleted>,
 
PBS_O_PATH=/usr/local/torque/sbin:/usr/local/torque/bin:/usr/bin:/bin
 
:/usr/sbin:/sbin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games
 
:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin,
        PBS_O_MAIL=/var/mail/<deleted>,PBS_O_SHELL=/bin/tcsh,
        PBS_SERVER=txmerig.nam.dow.com,PBS_O_HOST=txmerig.nam.dow.com,
        PBS_O_WORKDIR=/home/<deleted>,PBS_O_QUEUE=batch
    euser = <deleted>
    egroup = users
    hashname = 1059.<deleted>.com
    queue_rank = 996
    queue_type = E
    comment = Job started on Wed May 20 at 11:46
    etime = Wed May 20 11:46:18 2009
    submit_args = -l nodes=1:node4:ppn=4
    start_time = Wed May 20 11:46:26 2009
    start_count = 1

 
on other known working clusters, requesting resources in the same fasion
works fine as seen here:
    exec_host =
node14/3+node14/2+node14/1+node14/0+node13/3+node13/2+node13/1
        +node13/0



________________________________

	From: Jerry Smith [mailto:jdsmit at sandia.gov] 
	Sent: Wednesday, May 20, 2009 12:02 PM
	To: Edsall, William (WJ)
	Cc: torqueusers at supercluster.org
	Subject: Re: [torqueusers] PBS Scheduling Weirdness
	
	
	Sorry I forgot to ask this as well, can we get a copy of the
script you are submitting and the qsub command you are using?
	
	Jerry
	
	Edsall, William (WJ) wrote: 

		Hello,
		 Here is the output. I'm using the torque scheduler -
maui is on the system but not running.
		 
		# qmgr -c "p s"
		#
		# Create queues and set their attributes.
		#
		#
		# Create and define queue batch
		#
		create queue batch
		set queue batch queue_type = Execution
		set queue batch resources_default.nodes = 1
		set queue batch resources_default.walltime = 01:00:00
		set queue batch enabled = True
		set queue batch started = True
		#
		# Set server attributes.
		#
		set server scheduling = True
		set server acl_hosts = txmerig
		//stripped out the list of managers and operators
		set server default_queue = batch
		set server log_events = 511
		set server mail_from = adm
		set server scheduler_iteration = 600
		set server node_check_rate = 150
		set server tcp_timeout = 6
		set server next_job_number = 1054
		


________________________________

			From: Jerry Smith [mailto:jdsmit at sandia.gov] 
			Sent: Tuesday, May 19, 2009 4:05 PM
			To: Edsall, William (WJ)
			Cc: torqueusers at supercluster.org
			Subject: Re: [torqueusers] PBS Scheduling
Weirdness
			
			
			Can you give us the output from:
			
			qmgr -c "p s" 
			
			and are you using any external scheduler, Maui
or Moab or the like?
			
			Thanks,
			
			--Jerry
			
			Edsall, William (WJ) wrote: 

				Hello list, 
				 Having a strange problem with torque
version: 2.4.0b1. 

				It seems that no matter how much
resource I request, I only get one cpu on the first available node. 

				Please help me brainstorm the possible
causes. 
				
				_______________________________________
				William J. Edsall
				


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090520/2a6565dc/attachment-0001.html 


More information about the torqueusers mailing list