[torqueusers] jobs not beeing scheduled but many free slots
Arnau Bria
arnaubria at pic.es
Sat Jan 3 11:22:57 MST 2009
Hi all,
sorry for "future" crossposting but I asked this question to maui users
lists and I don't see it in archives... I tried to subscribe my user
again but still don't see my post....
Maybe someone could help me.
[...]
I have a starnge scenario. I have many queued jobs and many free
host slots. Maui schedules some jobsm, but then, the first job in queue
shows this message:
[root at pbs02 ~]# checkjob 1661257
checking job 1661257
State: Idle
Creds: user:cmprd028 group:cmprd class:glong qos:lhccms
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Jan 2 20:26:19
(Time Queued Total: 1:38:14 Eligible: 1:20:18)
StartDate: 00:00:01 Fri Jan 2 22:04:34
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [slc4]
Dedicated Resources Per Task: [NONE]
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 42
PartitionMask: [ALL]
Reservation '1661257' (00:00:01 -> 3:00:00:01 Duration: 3:00:00:00)
Messages: cannot start job - RM failure, rc: 15044, msg: 'Resource
temporarily unavailable REJHOST=td069.pic.es MSG=cannot allocate node
'td069.pic.es' to job - node not currently available (nps needed/free:
1/0, joblist:
1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
PE: 0.00 StartPriority: 44 cannot select job 1661257 for partition
DEFAULT (startdate in '00:00:01')
Asking maui:
[root at pbs02 ~]# checknode td069.pic.es
checking node td069.pic.es
State: Idle (in current state for 00:04:05)
Configured Resources: PROCS: 8 MEM: 15G SWAP: 30G DISK: 103G
Utilized Resources: DISK: 5238M
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 3.170
Network: [DEFAULT]
Features: [slc4][magic]
Attributes: [Batch]
Classes: [long 8:8][medium 8:8][short 8:8][pps 8:8][test 8:8][ifae
8:8][gshort 8:8][glong 8:8][gmedium 8:8][lhcbsl4 8:8][roman 8:8][magic
8:8]
Total Time: INFINITY Up: INFINITY (94.12%) Active: INFINITY
(79.20%)
Reservations:
NOTE: no reservations on node
but asking torque:
[root at pbs02 ~]# pbsnodes td069.pic.es
td069.pic.es
state = free
np = 8
properties = slc4,magic
ntype = cluster
status = opsys=linux,uname=Linux td069.pic.es 2.6.9-42.0.3.ELsmp
#1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=1074,totmem=32637840kb,availmem=32517752kb,physmem=16632008kb,ncpus=8,loadave=1.64,gres=cpu_factor:=1.85625,netload=1580354190,size=102928248kb:108277440kb,state=free,jobs=,varattr=,rectime=1230930523
and if I go to node:
[root at td069 ~]# ps faux|grep pbs
root 21179 0.0 0.0 4492 628 pts/0 S+ 22:10 0:00
\_ grep pbs root 3900 0.0 0.0 8592 8364 ? SLs 21:51
0:00 /usr/sbin/pbs_mom
so... torque says the node is empty, maui seeis it full and no job is
running in host...
maui log says something like:
[root at pbs02 ~]# grep td069 /var/log/maui.log
01/02 22:04:33 ERROR: job '1661257' cannot be started: (rc: 15044
errmsg: 'Resource temporarily unavailable REJHOST=td069.pic.es
MSG=cannot allocate node 'td069.pic.es' to job - node not currently
available (nps needed/free: 1/0, joblist:
1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
hostlist: 'td069.pic.es') 01/02 22:04:33 ERROR: job '1661258' cannot
be started: (rc: 15044 errmsg: 'Resource temporarily unavailable
REJHOST=td069.pic.es MSG=cannot allocate node 'td069.pic.es' to job -
node not currently available (nps needed/free: 1/0, joblist:
1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
hostlist: 'td069.pic.es')
Finally, showq says:
123 Active Jobs 123 of 773 Processors Active (15.91%)
15 of 128 Nodes Active (11.72%)
If it's relevant, nopde's torque logs:
01/02/2009 22:04:33;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=td069.pic.es MSG=modify job failed,
unknown job 1661247.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661248.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661249.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661250.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661252.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661253.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661255.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661256.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661248.pbs02.pic.es started, pid =
14112 01/02/2009 22:04:33;0001; pbs_mom;Job;TMomFinalizeJob3;job
1661256.pbs02.pic.es started, pid = 14128 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661250.pbs02.pic.es started, pid =
14146 01/02/2009 22:04:33;0001; pbs_mom;Job;TMomFinalizeJob3;job
1661252.pbs02.pic.es started, pid = 14167 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661253.pbs02.pic.es started, pid =
14173 01/02/2009 22:04:33;0001; pbs_mom;Job;TMomFinalizeJob3;job
1661249.pbs02.pic.es started, pid = 14189 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661247.pbs02.pic.es started, pid =
14228 01/02/2009 22:04:33;0001; pbs_mom;Job;TMomFinalizeJob3;job
1661255.pbs02.pic.es started, pid = 14242 01/02/2009 22:05:35;0080;
pbs_mom;Job;1661247.pbs02.pic.es;scan_for_terminated: job
1661247.pbs02.pic.es task 1 terminated, sid=14228 01/02/2009
22:05:35;0008; pbs_mom;Job;1661247.pbs02.pic.es;job was terminated
01/02/2009 22:05:35;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:05:35;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:05:35;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:05:35;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:05:35;0080;
pbs_mom;Job;1661247.pbs02.pic.es;obit sent to server 01/02/2009
22:05:35;0080; pbs_mom;Job;1661247.pbs02.pic.es;removing transient
job directory /home/tmp/1661247.pbs02.pic.es 01/02/2009
22:07:20;0080; pbs_mom;Job;1661249.pbs02.pic.es;scan_for_terminated:
job 1661249.pbs02.pic.es task 1 terminated, sid=14189 01/02/2009
22:07:20;0008; pbs_mom;Job;1661249.pbs02.pic.es;job was terminated
01/02/2009 22:07:20;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:20;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:20;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:20;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:20;0080;
pbs_mom;Job;1661249.pbs02.pic.es;obit sent to server 01/02/2009
22:07:20;0080; pbs_mom;Job;1661249.pbs02.pic.es;removing transient
job directory /home/tmp/1661249.pbs02.pic.es 01/02/2009
22:07:21;0080; pbs_mom;Job;1661255.pbs02.pic.es;scan_for_terminated:
job 1661255.pbs02.pic.es task 1 terminated, sid=14242 01/02/2009
22:07:21;0008; pbs_mom;Job;1661255.pbs02.pic.es;job was terminated
01/02/2009 22:07:21;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:21;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:21;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:21;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:21;0080;
pbs_mom;Job;1661255.pbs02.pic.es;obit sent to server 01/02/2009
22:07:21;0080; pbs_mom;Job;1661255.pbs02.pic.es;removing transient
job directory /home/tmp/1661255.pbs02.pic.es 01/02/2009
22:07:23;0080; pbs_mom;Job;1661256.pbs02.pic.es;scan_for_terminated:
job 1661256.pbs02.pic.es task 1 terminated, sid=14128 01/02/2009
22:07:23;0008; pbs_mom;Job;1661256.pbs02.pic.es;job was terminated
01/02/2009 22:07:23;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:23;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:23;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:24;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:24;0080;
pbs_mom;Job;1661256.pbs02.pic.es;obit sent to server 01/02/2009
22:07:24;0080; pbs_mom;Job;1661248.pbs02.pic.es;scan_for_terminated:
job 1661248.pbs02.pic.es task 1 terminated, sid=14112 01/02/2009
22:07:24;0008; pbs_mom;Job;1661248.pbs02.pic.es;job was terminated
01/02/2009 22:07:24;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:24;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:24;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:24;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:24;0080;
pbs_mom;Job;1661248.pbs02.pic.es;obit sent to server 01/02/2009
22:07:24;0080; pbs_mom;Job;1661256.pbs02.pic.es;removing transient
job directory /home/tmp/1661256.pbs02.pic.es 01/02/2009
22:07:24;0080; pbs_mom;Job;1661248.pbs02.pic.es;removing transient
job directory /home/tmp/1661248.pbs02.pic.es 01/02/2009
22:07:25;0080; pbs_mom;Job;1661250.pbs02.pic.es;scan_for_terminated:
job 1661250.pbs02.pic.es task 1 terminated, sid=14146 01/02/2009
22:07:25;0008; pbs_mom;Job;1661250.pbs02.pic.es;job was terminated
01/02/2009 22:07:25;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:25;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:25;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:25;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:25;0080;
pbs_mom;Job;1661250.pbs02.pic.es;obit sent to server 01/02/2009
22:07:26;0080; pbs_mom;Job;1661250.pbs02.pic.es;removing transient
job directory /home/tmp/1661250.pbs02.pic.es 01/02/2009
22:07:30;0080; pbs_mom;Job;1661252.pbs02.pic.es;scan_for_terminated:
job 1661252.pbs02.pic.es task 1 terminated, sid=14167 01/02/2009
22:07:30;0008; pbs_mom;Job;1661252.pbs02.pic.es;job was terminated
01/02/2009 22:07:30;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:30;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:30;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:30;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:30;0080;
pbs_mom;Job;1661252.pbs02.pic.es;obit sent to server 01/02/2009
22:07:31;0080; pbs_mom;Job;1661252.pbs02.pic.es;removing transient
job directory /home/tmp/1661252.pbs02.pic.es 01/02/2009
22:08:31;0080; pbs_mom;Job;1661253.pbs02.pic.es;scan_for_terminated:
job 1661253.pbs02.pic.es task 1 terminated, sid=14173 01/02/2009
22:08:31;0008; pbs_mom;Job;1661253.pbs02.pic.es;job was terminated
01/02/2009 22:08:32;0080; pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:08:32;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:08:32;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:08:32;0008; pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:08:32;0080;
pbs_mom;Job;1661253.pbs02.pic.es;obit sent to server 01/02/2009
22:08:32;0080; pbs_mom;Job;1661253.pbs02.pic.es;removing transient
job directory /home/tmp/1661253.pbs02.pic.es
So, could someone please help to find the reason why maui is stopping
scheduling jobs after one node is empty?
And, why maui is not aware of REAL node status?
Oh, maui conf know torque server:
[root at pbs02 ~]# grep pbs02 /var/spool/maui/maui.cfg
SERVERHOST pbs02.pic.es
ADMINHOST pbs02.pic.es
versions:
[root at pbs02 ~]# rpm -qa|grep maui
maui-3.2.6p20-snap.1182974819.8.slc4
maui-client-3.2.6p20-snap.1182974819.8.slc4
maui-server-3.2.6p20-snap.1182974819.8.slc4
[root at pbs02 ~]#
[root at pbs02 ~]# rpm -qa|grep torque
torque-2.3.0-snap.200801151629.2cri.slc4
glite-yaim-torque-server-4.0.1-5
torque-server-2.3.0-snap.200801151629.2cri.slc4
glite-yaim-torque-utils-4.0.2-2
torque-client-2.3.0-snap.200801151629.2cri.slc4
TIA,
Arnau
More information about the torqueusers
mailing list