[torqueusers] torque/maui assigning jobs to full nodes when other nodes are free
Paul Raines
raines at nmr.mgh.harvard.edu
Thu Jul 12 08:45:16 MDT 2012
As a followup, after running qrun on a job to get it to run on another node,
maui still seems confused thinking it is still allocated to compute-0-6 as
this output shows:
[root at launchpad ~]# checkjob 1713
checking job 1713
State: Running
Creds: user:award group:award class:p30 qos:DEFAULT
WallTime: 00:06:16 of 4:00:00:00
SubmitTime: Thu Jul 12 09:38:19
(Time Queued Total: 00:57:50 Eligible: 00:00:00)
StartTime: Thu Jul 12 10:36:09
StartDate: -1:03:59 Thu Jul 12 09:38:20
Total Tasks: 4
Req[0] TaskCount: 5 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [nonGPU]
NodeCount: 2
Allocated Nodes:
[compute-0-6:4][compute-0-16:1]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Reservation '1713' (-00:06:10 -> 3:23:53:50 Duration: 4:00:00:00)
Messages: cannot start job - RM failure, rc: 15046, msg: 'Resource
temporarily unavailable REJHOST=compute-0-6 MSG=cannot allocate node
'compute-0-6' to job - node not currently available (nps needed/free: 4/3,
gpus needed/free: 0/0, joblist:
1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)'
PE: 5.00 StartPriority: 103003
[root at launchpad ~]# qstat -n 1713
launchpad.nmr.mgh.harvard.edu:
Req'd
Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
1713.launchpad.n award p30 pbsjob_1420 10808 1 4 --
96:00 R 00:05
compute-0-16/3+compute-0-16/2+compute-0-16/1+compute-0-16/0
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Thu, 12 Jul 2012 10:39am, Paul Raines wrote:
>
> I just did a total reinstall on our batch cluster upgrading all nodes
> to CentOS6 and updating to torque-2.5.11 and maui-3.3.1
>
> I have over 100 nodes and only a few jobs submitted so far but
> somehow jobs are getting Deferred being assigned to nodes that
> have jobs already running on them even though pleny of empty
> free nodes exist.
>
> ==========================================================
> checking job 1710
>
> State: Idle EState: Deferred
> Creds: user:award group:award class:p30 qos:DEFAULT
> WallTime: 00:00:00 of 4:00:00:00
> SubmitTime: Thu Jul 12 09:38:18
> (Time Queued Total: 00:50:31 Eligible: 00:00:00)
>
> StartDate: -00:50:30 Thu Jul 12 09:38:19
> Total Tasks: 4
>
> Req[0] TaskCount: 4 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [nonGPU]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [ALL]
> job is deferred. Reason: RMFailure (cannot start job - RM failure, rc:
> 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 MSG=cannot
> allocate node 'compute-0-6' to job - node not currently available (nps
> needed/free: 4/3, gpus needed/free: 0/0, joblist:
> 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)')
> Holds: Defer (hold reason: RMFailure)
> PE: 4.00 StartPriority: 103050
> cannot select job 1710 for partition DEFAULT (job hold active)
> ==========================================================
>
> [root at launchpad ~]# pbsnodes -a compute-0-6
> compute-0-6
> state = job-exclusive
> np = 8
> properties = nonGPU
> ntype = cluster
> jobs = 0/1021.launchpad.nmr.mgh.harvard.edu,
> 1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu,
> 3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu,
> 5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu,
> 7/1806.launchpad.nmr.mgh.harvard.edu
> status =
> rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu
> 1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu
> 1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122
> 9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 #1
> SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux
> gpus = 0
>
> ==========================================================
>
> All these Deferred jobs are trying to run on compute-0-6
>
> ====================================================
> BLOCKED JOBS----------------
> JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
>
> 1710 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:18
> 1714 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:21
> 1715 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:22
> 1716 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:24
> 1717 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:25
> 1718 award Deferred 4 4:00:00:00 Thu Jul 12 09:38:27
> 1726 tyler Deferred 1 4:00:00:00 Thu Jul 12 09:40:46
> 1761 lzollei Deferred 5 4:00:00:00 Thu Jul 12 09:57:36
> 1764 award Deferred 4 4:00:00:00 Thu Jul 12 09:58:54
> 1777 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:04:18
> 1779 tyler Deferred 1 4:00:00:00 Thu Jul 12 10:04:36
> 1784 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:07:39
> 1791 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:11:00
> 1803 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:17:43
> 1814 lzollei Deferred 5 4:00:00:00 Thu Jul 12 10:21:04
> ====================================================
>
> Some jobs we submit still get run on other nodes just fine. It seems
> random what is getting assigned to compute-0-6 and then deferred.
>
> There are lots of identical configured nodes free. I can force these
> jobs to run on other nodes with qrun by hand but what is going on?
>
> Here is my maui config which worked fine in my older setup
> ==========================================================
> RMPOLLINTERVAL 00:00:30
> SERVERHOST launchpad.nmr.mgh.harvard.edu
> SERVERPORT 40559
> SERVERMODE NORMAL
> ADMINHOST launchpad.nmr.mgh.harvard.edu
> RMCFG[base] TYPE=PBS
> ADMIN1 maui root
> ADMIN3 ALL
> LOGFILE /var/spool/maui/log/maui.log
> LOGFILEMAXSIZE 1000000000
> LOGLEVEL 3
> QUEUETIMEWEIGHT 1
> CLASSWEIGHT 10
> USERCFG[DEFAULT] MAXIPROC=8
> CLASSCFG[default] MAXPROCPERUSER=150
> CLASSCFG[matlab] MAXPROCPERUSER=60
> CLASSCFG[max10] MAXPROCPERUSER=10
> CLASSCFG[max20] MAXPROCPERUSER=20
> CLASSCFG[max50] MAXPROCPERUSER=50
> CLASSCFG[max75] MAXPROCPERUSER=75
> CLASSCFG[max100] MAXPROCPERUSER=100
> CLASSCFG[max200] MAXPROCPERUSER=200
> CLASSCFG[p5] MAXPROCPERUSER=5000
> CLASSCFG[p10] MAXPROCPERUSER=5000
> CLASSCFG[p20] MAXPROCPERUSER=5000
> CLASSCFG[p30] MAXPROCPERUSER=5000
> CLASSCFG[p40] MAXPROCPERUSER=5000
> CLASSCFG[p50] MAXPROCPERUSER=30
> CLASSCFG[p60] MAXPROCPERUSER=20
> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
> CLASSCFG[GPU] MAXPROCPERUSER=5000
> BACKFILLPOLICY FIRSTFIT
> RESERVATIONPOLICY CURRENTHIGHEST
> NODEALLOCATIONPOLICY PRIORITY
> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
> ENFORCERESOURCELIMITS OFF
> ENABLEMULTIREQJOBS TRUE
> ====================================================
>
> There is nothing in the queue configs that would favor any nodes over
> any other.
>
> ---------------------------------------------------------------
> Paul Raines http://help.nmr.mgh.harvard.edu
> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street Charlestown, MA 02129 USA
>
>
>
>
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
More information about the torqueusers
mailing list