[torqueusers] torque/maui assigning jobs to full nodes when other nodes are free
Paul Raines
raines at nmr.mgh.harvard.edu
Thu Jul 12 08:39:41 MDT 2012
I just did a total reinstall on our batch cluster upgrading all nodes
to CentOS6 and updating to torque-2.5.11 and maui-3.3.1
I have over 100 nodes and only a few jobs submitted so far but
somehow jobs are getting Deferred being assigned to nodes that
have jobs already running on them even though pleny of empty
free nodes exist.
==========================================================
checking job 1710
State: Idle EState: Deferred
Creds: user:award group:award class:p30 qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Thu Jul 12 09:38:18
(Time Queued Total: 00:50:31 Eligible: 00:00:00)
StartDate: -00:50:30 Thu Jul 12 09:38:19
Total Tasks: 4
Req[0] TaskCount: 4 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nonGPU]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
job is deferred. Reason: RMFailure (cannot start job - RM failure, rc:
15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 MSG=cannot
allocate node 'compute-0-6' to job - node not currently available (nps
needed/free: 4/3, gpus needed/free: 0/0, joblist:
1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)')
Holds: Defer (hold reason: RMFailure)
PE: 4.00 StartPriority: 103050
cannot select job 1710 for partition DEFAULT (job hold active)
==========================================================
[root at launchpad ~]# pbsnodes -a compute-0-6
compute-0-6
state = job-exclusive
np = 8
properties = nonGPU
ntype = cluster
jobs = 0/1021.launchpad.nmr.mgh.harvard.edu,
1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu,
3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu,
5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu,
7/1806.launchpad.nmr.mgh.harvard.edu
status =
rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu
1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu
1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122
9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 #1
SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux
gpus = 0
==========================================================
All these Deferred jobs are trying to run on compute-0-6
====================================================
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
1710 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:18
1714 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:21
1715 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:22
1716 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:24
1717 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:25
1718 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:27
1726 tyler Deferred 1 4:00:00:00 Thu Jul 12 09:40:46
1761 lzollei Deferred 5 4:00:00:00 Thu Jul 12 09:57:36
1764 award Deferred 4 4:00:00:00 Thu Jul 12 09:58:54
1777 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:04:18
1779 tyler Deferred 1 4:00:00:00 Thu Jul 12 10:04:36
1784 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:07:39
1791 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:11:00
1803 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:17:43
1814 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:21:04
====================================================
Some jobs we submit still get run on other nodes just fine. It seems
random what is getting assigned to compute-0-6 and then deferred.
There are lots of identical configured nodes free. I can force these
jobs to run on other nodes with qrun by hand but what is going on?
Here is my maui config which worked fine in my older setup
==========================================================
RMPOLLINTERVAL 00:00:30
SERVERHOST launchpad.nmr.mgh.harvard.edu
SERVERPORT 40559
SERVERMODE NORMAL
ADMINHOST launchpad.nmr.mgh.harvard.edu
RMCFG[base] TYPE=PBS
ADMIN1 maui root
ADMIN3 ALL
LOGFILE /var/spool/maui/log/maui.log
LOGFILEMAXSIZE 1000000000
LOGLEVEL 3
QUEUETIMEWEIGHT 1
CLASSWEIGHT 10
USERCFG[DEFAULT] MAXIPROC=8
CLASSCFG[default] MAXPROCPERUSER=150
CLASSCFG[matlab] MAXPROCPERUSER=60
CLASSCFG[max10] MAXPROCPERUSER=10
CLASSCFG[max20] MAXPROCPERUSER=20
CLASSCFG[max50] MAXPROCPERUSER=50
CLASSCFG[max75] MAXPROCPERUSER=75
CLASSCFG[max100] MAXPROCPERUSER=100
CLASSCFG[max200] MAXPROCPERUSER=200
CLASSCFG[p5] MAXPROCPERUSER=5000
CLASSCFG[p10] MAXPROCPERUSER=5000
CLASSCFG[p20] MAXPROCPERUSER=5000
CLASSCFG[p30] MAXPROCPERUSER=5000
CLASSCFG[p40] MAXPROCPERUSER=5000
CLASSCFG[p50] MAXPROCPERUSER=30
CLASSCFG[p60] MAXPROCPERUSER=20
CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
CLASSCFG[GPU] MAXPROCPERUSER=5000
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY PRIORITY
NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
ENFORCERESOURCELIMITS OFF
ENABLEMULTIREQJOBS TRUE
====================================================
There is nothing in the queue configs that would favor any nodes over
any other.
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
More information about the torqueusers
mailing list