[torqueusers] torque/maui assigning jobs to full nodes when other nodes are free

Paul Raines raines at nmr.mgh.harvard.edu
Thu Jul 12 08:45:16 MDT 2012


As a followup, after running qrun on a job to get it to run on another node,
maui still seems confused thinking it is still allocated to compute-0-6 as 
this output shows:

[root at launchpad ~]# checkjob 1713


checking job 1713

State: Running
Creds:  user:award  group:award  class:p30  qos:DEFAULT
WallTime: 00:06:16 of 4:00:00:00
SubmitTime: Thu Jul 12 09:38:19
   (Time Queued  Total: 00:57:50  Eligible: 00:00:00)

StartTime: Thu Jul 12 10:36:09
StartDate: -1:03:59  Thu Jul 12 09:38:20
Total Tasks: 4

Req[0]  TaskCount: 5  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [nonGPU]
NodeCount: 2
Allocated Nodes:
[compute-0-6:4][compute-0-16:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Reservation '1713' (-00:06:10 -> 3:23:53:50  Duration: 4:00:00:00)
Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource 
temporarily unavailable REJHOST=compute-0-6 MSG=cannot allocate node 
'compute-0-6' to job - node not currently available (nps needed/free: 4/3, 
gpus needed/free: 0/0, joblist: 
1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)'
PE:  5.00  StartPriority:  103003

[root at launchpad ~]# qstat -n 1713

launchpad.nmr.mgh.harvard.edu:
                                                                          Req'd 
Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
1713.launchpad.n     award    p30      pbsjob_1420       10808     1   4    -- 
96:00 R 00:05
    compute-0-16/3+compute-0-16/2+compute-0-16/1+compute-0-16/0


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Thu, 12 Jul 2012 10:39am, Paul Raines wrote:

>
> I just did a total reinstall on our batch cluster upgrading all nodes
> to CentOS6 and updating to torque-2.5.11 and maui-3.3.1
>
> I have over 100 nodes and only a few jobs submitted so far but
> somehow jobs are getting Deferred being assigned to nodes that
> have jobs already running on them even though pleny of empty
> free nodes exist.
>
> ==========================================================
> checking job 1710
>
> State: Idle  EState: Deferred
> Creds:  user:award  group:award  class:p30  qos:DEFAULT
> WallTime: 00:00:00 of 4:00:00:00
> SubmitTime: Thu Jul 12 09:38:18
>  (Time Queued  Total: 00:50:31  Eligible: 00:00:00)
>
> StartDate: -00:50:30  Thu Jul 12 09:38:19
> Total Tasks: 4
>
> Req[0]  TaskCount: 4  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [nonGPU]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 
> 15046, msg: 'Resource temporarily unavailable REJHOST=compute-0-6 MSG=cannot 
> allocate node 'compute-0-6' to job - node not currently available (nps 
> needed/free: 4/3, gpus needed/free: 0/0, joblist: 
> 1021.launchpad.nmr.mgh.harvard.edu:0,1021.launchpad.nmr.mgh.harvard.edu:1,1021.launchpad.nmr.mgh.harvard.edu:2,1021.launchpad.nmr.mgh.harvard.edu:3,1021.launchpad.nmr.mgh.harvard.edu:4)')
> Holds:    Defer  (hold reason:  RMFailure)
> PE:  4.00  StartPriority:  103050
> cannot select job 1710 for partition DEFAULT (job hold active)
> ==========================================================
>
> [root at launchpad ~]# pbsnodes -a compute-0-6
> compute-0-6
>     state = job-exclusive
>     np = 8
>     properties = nonGPU
>     ntype = cluster
>     jobs = 0/1021.launchpad.nmr.mgh.harvard.edu, 
> 1/1021.launchpad.nmr.mgh.harvard.edu, 2/1021.launchpad.nmr.mgh.harvard.edu, 
> 3/1021.launchpad.nmr.mgh.harvard.edu, 4/1021.launchpad.nmr.mgh.harvard.edu, 
> 5/1754.launchpad.nmr.mgh.harvard.edu, 6/1816.launchpad.nmr.mgh.harvard.edu, 
> 7/1806.launchpad.nmr.mgh.harvard.edu
>     status = 
> rectime=1342103360,varattr=,jobs=1021.launchpad.nmr.mgh.harvard.edu 
> 1754.launchpad.nmr.mgh.harvard.edu 1806.launchpad.nmr.mgh.harvard.edu 
> 1816.launchpad.nmr.mgh.harvard.edu,state=free,netload=65919428331,gres=,loadave=5.39,ncpus=8,physmem=32877888kb,availmem=86083428kb,totmem=99986744kb,idletime=143787,nusers=4,nsessions=5,sessions=4122 
> 9023 27009 26961 28966,uname=Linux compute-0-6 2.6.32-220.23.1.el6.x86_64 #1 
> SMP Mon Jun 18 18:58:52 BST 2012 x86_64,opsys=linux
>     gpus = 0
>
> ==========================================================
>
> All these Deferred jobs are trying to run on compute-0-6
>
> ====================================================
> BLOCKED JOBS----------------
> JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
>
> 1710                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:18
> 1714                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:21
> 1715                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:22
> 1716                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:24
> 1717                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:25
> 1718                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:38:27
> 1726                  tyler   Deferred     1  4:00:00:00  Thu Jul 12 09:40:46
> 1761                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 09:57:36
> 1764                  award   Deferred     4  4:00:00:00  Thu Jul 12 09:58:54
> 1777                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:04:18
> 1779                  tyler   Deferred     1  4:00:00:00  Thu Jul 12 10:04:36
> 1784                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:07:39
> 1791                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:11:00
> 1803                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:17:43
> 1814                lzollei   Deferred     5  4:00:00:00  Thu Jul 12 10:21:04
> ====================================================
>
> Some jobs we submit still get run on other nodes just fine.  It seems
> random what is getting assigned to compute-0-6 and then deferred.
>
> There are lots of identical configured nodes free.  I can force these
> jobs to run on other nodes with qrun by hand but what is going on?
>
> Here is my maui config which worked fine in my older setup
> ==========================================================
> RMPOLLINTERVAL		00:00:30
> SERVERHOST		launchpad.nmr.mgh.harvard.edu
> SERVERPORT		40559
> SERVERMODE		NORMAL
> ADMINHOST		launchpad.nmr.mgh.harvard.edu
> RMCFG[base]		TYPE=PBS
> ADMIN1                maui root
> ADMIN3                ALL
> LOGFILE               /var/spool/maui/log/maui.log
> LOGFILEMAXSIZE        1000000000
> LOGLEVEL              3
> QUEUETIMEWEIGHT       1
> CLASSWEIGHT           10
> USERCFG[DEFAULT] MAXIPROC=8
> CLASSCFG[default] MAXPROCPERUSER=150
> CLASSCFG[matlab] MAXPROCPERUSER=60
> CLASSCFG[max10] MAXPROCPERUSER=10
> CLASSCFG[max20] MAXPROCPERUSER=20
> CLASSCFG[max50] MAXPROCPERUSER=50
> CLASSCFG[max75] MAXPROCPERUSER=75
> CLASSCFG[max100] MAXPROCPERUSER=100
> CLASSCFG[max200] MAXPROCPERUSER=200
> CLASSCFG[p5] MAXPROCPERUSER=5000
> CLASSCFG[p10] MAXPROCPERUSER=5000
> CLASSCFG[p20] MAXPROCPERUSER=5000
> CLASSCFG[p30] MAXPROCPERUSER=5000
> CLASSCFG[p40] MAXPROCPERUSER=5000
> CLASSCFG[p50] MAXPROCPERUSER=30
> CLASSCFG[p60] MAXPROCPERUSER=20
> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
> CLASSCFG[GPU] MAXPROCPERUSER=5000
> BACKFILLPOLICY        FIRSTFIT
> RESERVATIONPOLICY     CURRENTHIGHEST
> NODEALLOCATIONPOLICY  PRIORITY
> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
> ENFORCERESOURCELIMITS   OFF
> ENABLEMULTIREQJOBS TRUE
> ====================================================
>
> There is nothing in the queue configs that would favor any nodes over
> any other.
>
> ---------------------------------------------------------------
> Paul Raines                     http://help.nmr.mgh.harvard.edu
> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street     Charlestown, MA 02129	    USA
>
>
>
>


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.



More information about the torqueusers mailing list