[torqueusers] torque/maui assigning jobs to full nodes when other nodes are free

Paul Raines raines at nmr.mgh.harvard.edu
Thu Jul 12 08:39:41 MDT 2012


I just did a total reinstall on our batch cluster upgrading all nodes
to CentOS6 and updating to torque-2.5.11 and maui-3.3.1

I have over 100 nodes and only a few jobs submitted so far but
somehow jobs are getting Deferred being assigned to nodes that
have jobs already running on them even though pleny of empty
free nodes exist.

==========================================================
checking job 1710

State: Idle  EState: Deferred
Creds:  user:award  group:award  class:p30  qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Thu Jul 12 09:38:18
   (Time Queued  Total: 00:50:31  Eligible: 00:00:00)

StartDate: -00:50:30  Thu Jul 12 09:38:19
Total Tasks: 4

Req[0]  TaskCount: 4  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [nonGPU]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 
15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 MSG=cannot 
allocate node 'compute-0-6' to job - node not currently available (nps 
needed/free: 4/3, gpus needed/free: 0/0, joblist: 
1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)')
Holds:    Defer  (hold reason:  RMFailure)
PE:  4.00  StartPriority:  103050
cannot select job 1710 for partition DEFAULT (job hold active)
==========================================================

[root at launchpad ~]# pbsnodes -a compute-0-6
compute-0-6
      state = job-exclusive
      np = 8
      properties = nonGPU
      ntype = cluster
      jobs = 0/1021.launchpad.nmr.mgh.harvard.edu, 
1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu, 
3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu, 
5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu, 
7/1806.launchpad.nmr.mgh.harvard.edu
      status = 
rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu 
1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu 
1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122 
9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 #1 
SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux
      gpus = 0

==========================================================

All these Deferred jobs are trying to run on compute-0-6

====================================================
BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

1710                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:18
1714                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:21
1715                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:22
1716                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:24
1717                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:25
1718                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:27
1726                  tyler   Deferred     1  4:00:00:00  Thu Jul 12 09:40:46
1761                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 09:57:36
1764                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:58:54
1777                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:04:18
1779                  tyler   Deferred     1  4:00:00:00  Thu Jul 12 10:04:36
1784                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:07:39
1791                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:11:00
1803                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:17:43
1814                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:21:04
====================================================

Some jobs we submit still get run on other nodes just fine.  It seems
random what is getting assigned to compute-0-6 and then deferred.

There are lots of identical configured nodes free.  I can force these
jobs to run on other nodes with qrun by hand but what is going on?

Here is my maui config which worked fine in my older setup
==========================================================
RMPOLLINTERVAL		00:00:30
SERVERHOST		launchpad.nmr.mgh.harvard.edu
SERVERPORT		40559
SERVERMODE		NORMAL
ADMINHOST		launchpad.nmr.mgh.harvard.edu
RMCFG[base]		TYPE=PBS
ADMIN1                maui root
ADMIN3                ALL
LOGFILE               /var/spool/maui/log/maui.log
LOGFILEMAXSIZE        1000000000
LOGLEVEL              3
QUEUETIMEWEIGHT       1
CLASSWEIGHT           10
USERCFG[DEFAULT] MAXIPROC=8
CLASSCFG[default] MAXPROCPERUSER=150
CLASSCFG[matlab] MAXPROCPERUSER=60
CLASSCFG[max10] MAXPROCPERUSER=10
CLASSCFG[max20] MAXPROCPERUSER=20
CLASSCFG[max50] MAXPROCPERUSER=50
CLASSCFG[max75] MAXPROCPERUSER=75
CLASSCFG[max100] MAXPROCPERUSER=100
CLASSCFG[max200] MAXPROCPERUSER=200
CLASSCFG[p5] MAXPROCPERUSER=5000
CLASSCFG[p10] MAXPROCPERUSER=5000
CLASSCFG[p20] MAXPROCPERUSER=5000
CLASSCFG[p30] MAXPROCPERUSER=5000
CLASSCFG[p40] MAXPROCPERUSER=5000
CLASSCFG[p50] MAXPROCPERUSER=30
CLASSCFG[p60] MAXPROCPERUSER=20
CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
CLASSCFG[GPU] MAXPROCPERUSER=5000
BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST
NODEALLOCATIONPOLICY  PRIORITY
NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
ENFORCERESOURCELIMITS   OFF
ENABLEMULTIREQJOBS TRUE
====================================================

There is nothing in the queue configs that would favor any nodes over
any other.

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA





The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.



More information about the torqueusers mailing list