[torqueusers] PBS Scheduling Weirdness
Edsall, William (WJ)
WJEdsall at dow.com
Wed May 20 11:57:58 MDT 2009
Thank you for the thought out responses, we've solved the issue.
Unfortunately the pbs_server was locked up and i did not know it.
restart attempts were appearing to work but we ended up killing it -9.
After restarting it, the jobs are working well.
Thanks again.
________________________________
From: Jerry Smith [mailto:jdsmit at sandia.gov]
Sent: Wednesday, May 20, 2009 1:29 PM
To: Edsall, William (WJ)
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] PBS Scheduling Weirdness
Try:
echo "sleep 10" | qsub -l nodes=node4:ppn=4
or
echo "sleep 10" | qsub -l nodes=1:ppn=4
Does this change anything?
--Jerry
Edsall, William (WJ) wrote:
I usually test with a STDIN command such as this.
> echo "sleep 10" | qsub -l nodes=1:node4:ppn=4
My job runs, but as you can see i only get one cpu, on
the wrong resource. This is the same as requesting multiple nodes. This
was working and works on our other clusters but as of monday this week
it fails.
> qstat -f 1059
Job Id: 1059
Job_Name = STDIN
Job_Owner = <deleted>
job_state = R
queue = batch
server = <deleted>com
Checkpoint = u
ctime = Wed May 20 11:46:18 2009
Error_Path = <deleted>
exec_host = node2/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed May 20 11:46:26 2009
Output_Path = <deleted>/STDIN.o1059
Priority = 0
qtime = Wed May 20 11:46:18 2009
Rerunable = True
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 01:00:00
session_id = 12814
substate = 42
Variable_List =
PBS_O_HOME=/home/<deleted>,PBS_O_LANG=POSIX,
PBS_O_LOGNAME=<deleted>,
PBS_O_PATH=/usr/local/torque/sbin:/usr/local/torque/bin:/usr/bin:/bin
:/usr/sbin:/sbin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games
:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin,
PBS_O_MAIL=/var/mail/<deleted>,PBS_O_SHELL=/bin/tcsh,
PBS_SERVER=txmerig.nam.dow.com,PBS_O_HOST=txmerig.nam.dow.com,
PBS_O_WORKDIR=/home/<deleted>,PBS_O_QUEUE=batch
euser = <deleted>
egroup = users
hashname = 1059.<deleted>.com
queue_rank = 996
queue_type = E
comment = Job started on Wed May 20 at 11:46
etime = Wed May 20 11:46:18 2009
submit_args = -l nodes=1:node4:ppn=4
start_time = Wed May 20 11:46:26 2009
start_count = 1
on other known working clusters, requesting resources in
the same fasion works fine as seen here:
exec_host =
node14/3+node14/2+node14/1+node14/0+node13/3+node13/2+node13/1
+node13/0
________________________________
From: Jerry Smith [mailto:jdsmit at sandia.gov]
Sent: Wednesday, May 20, 2009 12:02 PM
To: Edsall, William (WJ)
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] PBS Scheduling
Weirdness
Sorry I forgot to ask this as well, can we get a
copy of the script you are submitting and the qsub command you are
using?
Jerry
Edsall, William (WJ) wrote:
Hello,
Here is the output. I'm using the
torque scheduler - maui is on the system but not running.
# qmgr -c "p s"
#
# Create queues and set their
attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes
= 1
set queue batch
resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = txmerig
//stripped out the list of managers and
operators
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 1054
________________________________
From: Jerry Smith
[mailto:jdsmit at sandia.gov]
Sent: Tuesday, May 19, 2009 4:05 PM
To: Edsall, William (WJ)
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] PBS
Scheduling Weirdness
Can you give us the output from:
qmgr -c "p s"
and are you using any external
scheduler, Maui or Moab or the like?
Thanks,
--Jerry
Edsall, William (WJ) wrote:
Hello list,
Having a strange problem with torque
version: 2.4.0b1.
It seems that no matter how much
resource I request, I only get one cpu on the first available node.
Please help me brainstorm the possible
causes.
_______________________________________
William J. Edsall
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090520/1d5f98cf/attachment-0001.html
More information about the torqueusers
mailing list