[torqueusers] jobs not beeing scheduled but many free slots

Craig Macdonald craigm at dcs.gla.ac.uk
Sat Jan 3 11:29:06 MST 2009


Hi,

Aside: Often the lists require moderation, and I guess the moderator is 
on holiday ;-)

I think the problem is that while the node is free, the loadavg on the 
node suggests otherwise:

pbsnodes reports
	loadave=1.64

maui reports
      Load: 3.170
(I guess a time difference in your grabs there).

There must be other processes running on the node - either not managed 
by torque, or that havent been killed properly by torque (some people 
use kill epilogue scripts).

You probably want to configure maui parameter NODEALLOCATIONPOLICY - see
http://www.clusterresources.com/products/maui/docs/5.2nodeallocation.shtml

Craig

Arnau Bria wrote:
> Hi all,
>
> sorry for "future" crossposting but I asked this question to maui users
> lists and I don't see it in archives... I tried to subscribe my user
> again but still don't see my post....
> Maybe someone could help me.
>
> [...]
>
> I have a starnge scenario. I have many queued jobs and many free
> host slots. Maui schedules some jobsm, but then, the first job in queue
> shows this message:
>
> [root at pbs02 ~]# checkjob 1661257
>
>
> checking job 1661257
>
> State: Idle
> Creds:  user:cmprd028  group:cmprd  class:glong  qos:lhccms
> WallTime: 00:00:00 of 3:00:00:00
> SubmitTime: Fri Jan  2 20:26:19
>   (Time Queued  Total: 1:38:14  Eligible: 1:20:18)
>
> StartDate: 00:00:01  Fri Jan  2 22:04:34
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [slc4]
> Dedicated Resources Per Task: [NONE]
> NodeCount: 1
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 42
> PartitionMask: [ALL]
> Reservation '1661257' (00:00:01 -> 3:00:00:01  Duration: 3:00:00:00)
> Messages:  cannot start job - RM failure, rc: 15044, msg: 'Resource
> temporarily unavailable REJHOST=td069.pic.es MSG=cannot allocate node
> 'td069.pic.es' to job - node not currently available (nps needed/free:
> 1/0,  joblist:
> 1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
> PE:  0.00  StartPriority:  44 cannot select job 1661257 for partition
> DEFAULT (startdate in '00:00:01')
>
>
> Asking maui:
> [root at pbs02 ~]# checknode td069.pic.es
>
>
> checking node td069.pic.es
>
> State:      Idle  (in current state for 00:04:05)
> Configured Resources: PROCS: 8  MEM: 15G  SWAP: 30G  DISK: 103G
> Utilized   Resources: DISK: 5238M
> Dedicated  Resources: [NONE]
> Opsys:         linux  Arch:      [NONE]
> Speed:      1.00  Load:       3.170
> Network:    [DEFAULT]
> Features:   [slc4][magic]
> Attributes: [Batch]
> Classes:    [long 8:8][medium 8:8][short 8:8][pps 8:8][test 8:8][ifae
> 8:8][gshort 8:8][glong 8:8][gmedium 8:8][lhcbsl4 8:8][roman 8:8][magic
> 8:8]
>
> Total Time:   INFINITY  Up:   INFINITY (94.12%)  Active:   INFINITY
> (79.20%)
>
> Reservations:
> NOTE:  no reservations on node
>
>
>
>
> but asking torque:
>
>
> [root at pbs02 ~]# pbsnodes td069.pic.es
> td069.pic.es
>      state = free
>      np = 8
>      properties = slc4,magic
>      ntype = cluster
>      status = opsys=linux,uname=Linux td069.pic.es 2.6.9-42.0.3.ELsmp
> #1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=? 0,nsessions=?
> 0,nusers=0,idletime=1074,totmem=32637840kb,availmem=32517752kb,physmem=16632008kb,ncpus=8,loadave=1.64,gres=cpu_factor:=1.85625,netload=1580354190,size=102928248kb:108277440kb,state=free,jobs=,varattr=,rectime=1230930523
>
> and if I go to node:
> [root at td069 ~]# ps faux|grep pbs
> root     21179  0.0  0.0  4492  628 pts/0    S+   22:10   0:00
> \_ grep pbs root      3900  0.0  0.0  8592 8364 ?        SLs  21:51
> 0:00 /usr/sbin/pbs_mom
>
> so... torque says the node is empty, maui seeis it full and no job is
> running in host...
>
>
> maui log says something like:
>
> [root at pbs02 ~]# grep td069 /var/log/maui.log
> 01/02 22:04:33 ERROR:    job '1661257' cannot be started: (rc: 15044
> errmsg: 'Resource temporarily unavailable REJHOST=td069.pic.es
> MSG=cannot allocate node 'td069.pic.es' to job - node not currently
> available (nps needed/free: 1/0,  joblist:
> 1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
> hostlist: 'td069.pic.es') 01/02 22:04:33 ERROR:    job '1661258' cannot
> be started: (rc: 15044  errmsg: 'Resource temporarily unavailable
> REJHOST=td069.pic.es MSG=cannot allocate node 'td069.pic.es' to job -
> node not currently available (nps needed/free: 1/0,  joblist:
> 1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
> hostlist: 'td069.pic.es')
>
>
> Finally, showq says:
>    123 Active Jobs     123 of  773 Processors Active (15.91%)
>                         15 of  128 Nodes Active      (11.72%)
>
>
> If it's relevant, nopde's torque logs:
>
> 01/02/2009 22:04:33;0080;   pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id REJHOST=td069.pic.es MSG=modify job failed,
> unknown job 1661247.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> REJHOST=td069.pic.es MSG=modify job failed, unknown job
> 1661248.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> REJHOST=td069.pic.es MSG=modify job failed, unknown job
> 1661249.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> REJHOST=td069.pic.es MSG=modify job failed, unknown job
> 1661250.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> REJHOST=td069.pic.es MSG=modify job failed, unknown job
> 1661252.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> REJHOST=td069.pic.es MSG=modify job failed, unknown job
> 1661253.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> REJHOST=td069.pic.es MSG=modify job failed, unknown job
> 1661255.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
> REJHOST=td069.pic.es MSG=modify job failed, unknown job
> 1661256.pbs02.pic.es), aux=0, type=ModifyJob, from
> PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0001;
> pbs_mom;Job;TMomFinalizeJob3;job 1661248.pbs02.pic.es started, pid =
> 14112 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
> 1661256.pbs02.pic.es started, pid = 14128 01/02/2009 22:04:33;0001;
> pbs_mom;Job;TMomFinalizeJob3;job 1661250.pbs02.pic.es started, pid =
> 14146 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
> 1661252.pbs02.pic.es started, pid = 14167 01/02/2009 22:04:33;0001;
> pbs_mom;Job;TMomFinalizeJob3;job 1661253.pbs02.pic.es started, pid =
> 14173 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
> 1661249.pbs02.pic.es started, pid = 14189 01/02/2009 22:04:33;0001;
> pbs_mom;Job;TMomFinalizeJob3;job 1661247.pbs02.pic.es started, pid =
> 14228 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
> 1661255.pbs02.pic.es started, pid = 14242 01/02/2009 22:05:35;0080;
> pbs_mom;Job;1661247.pbs02.pic.es;scan_for_terminated: job
> 1661247.pbs02.pic.es task 1 terminated, sid=14228 01/02/2009
> 22:05:35;0008;   pbs_mom;Job;1661247.pbs02.pic.es;job was terminated
> 01/02/2009 22:05:35;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:05:35;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:05:35;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:05:35;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:05:35;0080;
> pbs_mom;Job;1661247.pbs02.pic.es;obit sent to server 01/02/2009
> 22:05:35;0080;   pbs_mom;Job;1661247.pbs02.pic.es;removing transient
> job directory /home/tmp/1661247.pbs02.pic.es 01/02/2009
> 22:07:20;0080;   pbs_mom;Job;1661249.pbs02.pic.es;scan_for_terminated:
> job 1661249.pbs02.pic.es task 1 terminated, sid=14189 01/02/2009
> 22:07:20;0008;   pbs_mom;Job;1661249.pbs02.pic.es;job was terminated
> 01/02/2009 22:07:20;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:07:20;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:07:20;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:07:20;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:07:20;0080;
> pbs_mom;Job;1661249.pbs02.pic.es;obit sent to server 01/02/2009
> 22:07:20;0080;   pbs_mom;Job;1661249.pbs02.pic.es;removing transient
> job directory /home/tmp/1661249.pbs02.pic.es 01/02/2009
> 22:07:21;0080;   pbs_mom;Job;1661255.pbs02.pic.es;scan_for_terminated:
> job 1661255.pbs02.pic.es task 1 terminated, sid=14242 01/02/2009
> 22:07:21;0008;   pbs_mom;Job;1661255.pbs02.pic.es;job was terminated
> 01/02/2009 22:07:21;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:07:21;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:07:21;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:07:21;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:07:21;0080;
> pbs_mom;Job;1661255.pbs02.pic.es;obit sent to server 01/02/2009
> 22:07:21;0080;   pbs_mom;Job;1661255.pbs02.pic.es;removing transient
> job directory /home/tmp/1661255.pbs02.pic.es 01/02/2009
> 22:07:23;0080;   pbs_mom;Job;1661256.pbs02.pic.es;scan_for_terminated:
> job 1661256.pbs02.pic.es task 1 terminated, sid=14128 01/02/2009
> 22:07:23;0008;   pbs_mom;Job;1661256.pbs02.pic.es;job was terminated
> 01/02/2009 22:07:23;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:07:23;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:07:23;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:07:24;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:07:24;0080;
> pbs_mom;Job;1661256.pbs02.pic.es;obit sent to server 01/02/2009
> 22:07:24;0080;   pbs_mom;Job;1661248.pbs02.pic.es;scan_for_terminated:
> job 1661248.pbs02.pic.es task 1 terminated, sid=14112 01/02/2009
> 22:07:24;0008;   pbs_mom;Job;1661248.pbs02.pic.es;job was terminated
> 01/02/2009 22:07:24;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:07:24;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:07:24;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:07:24;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:07:24;0080;
> pbs_mom;Job;1661248.pbs02.pic.es;obit sent to server 01/02/2009
> 22:07:24;0080;   pbs_mom;Job;1661256.pbs02.pic.es;removing transient
> job directory /home/tmp/1661256.pbs02.pic.es 01/02/2009
> 22:07:24;0080;   pbs_mom;Job;1661248.pbs02.pic.es;removing transient
> job directory /home/tmp/1661248.pbs02.pic.es 01/02/2009
> 22:07:25;0080;   pbs_mom;Job;1661250.pbs02.pic.es;scan_for_terminated:
> job 1661250.pbs02.pic.es task 1 terminated, sid=14146 01/02/2009
> 22:07:25;0008;   pbs_mom;Job;1661250.pbs02.pic.es;job was terminated
> 01/02/2009 22:07:25;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:07:25;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:07:25;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:07:25;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:07:25;0080;
> pbs_mom;Job;1661250.pbs02.pic.es;obit sent to server 01/02/2009
> 22:07:26;0080;   pbs_mom;Job;1661250.pbs02.pic.es;removing transient
> job directory /home/tmp/1661250.pbs02.pic.es 01/02/2009
> 22:07:30;0080;   pbs_mom;Job;1661252.pbs02.pic.es;scan_for_terminated:
> job 1661252.pbs02.pic.es task 1 terminated, sid=14167 01/02/2009
> 22:07:30;0008;   pbs_mom;Job;1661252.pbs02.pic.es;job was terminated
> 01/02/2009 22:07:30;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:07:30;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:07:30;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:07:30;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:07:30;0080;
> pbs_mom;Job;1661252.pbs02.pic.es;obit sent to server 01/02/2009
> 22:07:31;0080;   pbs_mom;Job;1661252.pbs02.pic.es;removing transient
> job directory /home/tmp/1661252.pbs02.pic.es 01/02/2009
> 22:08:31;0080;   pbs_mom;Job;1661253.pbs02.pic.es;scan_for_terminated:
> job 1661253.pbs02.pic.es task 1 terminated, sid=14173 01/02/2009
> 22:08:31;0008;   pbs_mom;Job;1661253.pbs02.pic.es;job was terminated
> 01/02/2009 22:08:32;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply 01/02/2009 22:08:32;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop 01/02/2009 22:08:32;0080;
> pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
> 01/02/2009 22:08:32;0008;   pbs_mom;Job;scan_for_terminated;checking
> job post-processing routine 01/02/2009 22:08:32;0080;
> pbs_mom;Job;1661253.pbs02.pic.es;obit sent to server 01/02/2009
> 22:08:32;0080;   pbs_mom;Job;1661253.pbs02.pic.es;removing transient
> job directory /home/tmp/1661253.pbs02.pic.es
>
>
> So, could someone please help to find the reason why maui is stopping
> scheduling jobs after one node is empty?
> And, why maui is not aware of REAL node status?
>
> Oh, maui conf know torque server:
> [root at pbs02 ~]# grep pbs02 /var/spool/maui/maui.cfg
> SERVERHOST              pbs02.pic.es
> ADMINHOST               pbs02.pic.es
>
>
> versions:
> [root at pbs02 ~]# rpm -qa|grep maui
>
> maui-3.2.6p20-snap.1182974819.8.slc4
> maui-client-3.2.6p20-snap.1182974819.8.slc4
> maui-server-3.2.6p20-snap.1182974819.8.slc4
> [root at pbs02 ~]# 
> [root at pbs02 ~]# rpm -qa|grep torque
> torque-2.3.0-snap.200801151629.2cri.slc4
> glite-yaim-torque-server-4.0.1-5
> torque-server-2.3.0-snap.200801151629.2cri.slc4
> glite-yaim-torque-utils-4.0.2-2
> torque-client-2.3.0-snap.200801151629.2cri.slc4
>
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list