[torqueusers] jobs not beeing scheduled but many free slots

Arnau Bria arnaubria at pic.es
Sat Jan 3 11:22:57 MST 2009


Hi all,

sorry for "future" crossposting but I asked this question to maui users
lists and I don't see it in archives... I tried to subscribe my user
again but still don't see my post....
Maybe someone could help me.

[...]

I have a starnge scenario. I have many queued jobs and many free
host slots. Maui schedules some jobsm, but then, the first job in queue
shows this message:

[root at pbs02 ~]# checkjob 1661257


checking job 1661257

State: Idle
Creds:  user:cmprd028  group:cmprd  class:glong  qos:lhccms
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Jan  2 20:26:19
  (Time Queued  Total: 1:38:14  Eligible: 1:20:18)

StartDate: 00:00:01  Fri Jan  2 22:04:34
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [slc4]
Dedicated Resources Per Task: [NONE]
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 42
PartitionMask: [ALL]
Reservation '1661257' (00:00:01 -> 3:00:00:01  Duration: 3:00:00:00)
Messages:  cannot start job - RM failure, rc: 15044, msg: 'Resource
temporarily unavailable REJHOST=td069.pic.es MSG=cannot allocate node
'td069.pic.es' to job - node not currently available (nps needed/free:
1/0,  joblist:
1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
PE:  0.00  StartPriority:  44 cannot select job 1661257 for partition
DEFAULT (startdate in '00:00:01')


Asking maui:
[root at pbs02 ~]# checknode td069.pic.es


checking node td069.pic.es

State:      Idle  (in current state for 00:04:05)
Configured Resources: PROCS: 8  MEM: 15G  SWAP: 30G  DISK: 103G
Utilized   Resources: DISK: 5238M
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       3.170
Network:    [DEFAULT]
Features:   [slc4][magic]
Attributes: [Batch]
Classes:    [long 8:8][medium 8:8][short 8:8][pps 8:8][test 8:8][ifae
8:8][gshort 8:8][glong 8:8][gmedium 8:8][lhcbsl4 8:8][roman 8:8][magic
8:8]

Total Time:   INFINITY  Up:   INFINITY (94.12%)  Active:   INFINITY
(79.20%)

Reservations:
NOTE:  no reservations on node




but asking torque:


[root at pbs02 ~]# pbsnodes td069.pic.es
td069.pic.es
     state = free
     np = 8
     properties = slc4,magic
     ntype = cluster
     status = opsys=linux,uname=Linux td069.pic.es 2.6.9-42.0.3.ELsmp
#1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=1074,totmem=32637840kb,availmem=32517752kb,physmem=16632008kb,ncpus=8,loadave=1.64,gres=cpu_factor:=1.85625,netload=1580354190,size=102928248kb:108277440kb,state=free,jobs=,varattr=,rectime=1230930523

and if I go to node:
[root at td069 ~]# ps faux|grep pbs
root     21179  0.0  0.0  4492  628 pts/0    S+   22:10   0:00
\_ grep pbs root      3900  0.0  0.0  8592 8364 ?        SLs  21:51
0:00 /usr/sbin/pbs_mom

so... torque says the node is empty, maui seeis it full and no job is
running in host...


maui log says something like:

[root at pbs02 ~]# grep td069 /var/log/maui.log
01/02 22:04:33 ERROR:    job '1661257' cannot be started: (rc: 15044
errmsg: 'Resource temporarily unavailable REJHOST=td069.pic.es
MSG=cannot allocate node 'td069.pic.es' to job - node not currently
available (nps needed/free: 1/0,  joblist:
1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
hostlist: 'td069.pic.es') 01/02 22:04:33 ERROR:    job '1661258' cannot
be started: (rc: 15044  errmsg: 'Resource temporarily unavailable
REJHOST=td069.pic.es MSG=cannot allocate node 'td069.pic.es' to job -
node not currently available (nps needed/free: 1/0,  joblist:
1661247.pbs02.pic.es:0,1661248.pbs02.pic.es:1,1661249.pbs02.pic.es:2,1661250.pbs02.pic.es:3,1661252.pbs02.pic.es:4,1661253.pbs02.pic.es:5,1661255.pbs02.pic.es:6,1661256.pbs02.pic.es:7)'
hostlist: 'td069.pic.es')


Finally, showq says:
   123 Active Jobs     123 of  773 Processors Active (15.91%)
                        15 of  128 Nodes Active      (11.72%)


If it's relevant, nopde's torque logs:

01/02/2009 22:04:33;0080;   pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=td069.pic.es MSG=modify job failed,
unknown job 1661247.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661248.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661249.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661250.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661252.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661253.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661255.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0080;
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id
REJHOST=td069.pic.es MSG=modify job failed, unknown job
1661256.pbs02.pic.es), aux=0, type=ModifyJob, from
PBS_Server at pbs02.pic.es 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661248.pbs02.pic.es started, pid =
14112 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
1661256.pbs02.pic.es started, pid = 14128 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661250.pbs02.pic.es started, pid =
14146 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
1661252.pbs02.pic.es started, pid = 14167 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661253.pbs02.pic.es started, pid =
14173 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
1661249.pbs02.pic.es started, pid = 14189 01/02/2009 22:04:33;0001;
pbs_mom;Job;TMomFinalizeJob3;job 1661247.pbs02.pic.es started, pid =
14228 01/02/2009 22:04:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job
1661255.pbs02.pic.es started, pid = 14242 01/02/2009 22:05:35;0080;
pbs_mom;Job;1661247.pbs02.pic.es;scan_for_terminated: job
1661247.pbs02.pic.es task 1 terminated, sid=14228 01/02/2009
22:05:35;0008;   pbs_mom;Job;1661247.pbs02.pic.es;job was terminated
01/02/2009 22:05:35;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:05:35;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:05:35;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:05:35;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:05:35;0080;
pbs_mom;Job;1661247.pbs02.pic.es;obit sent to server 01/02/2009
22:05:35;0080;   pbs_mom;Job;1661247.pbs02.pic.es;removing transient
job directory /home/tmp/1661247.pbs02.pic.es 01/02/2009
22:07:20;0080;   pbs_mom;Job;1661249.pbs02.pic.es;scan_for_terminated:
job 1661249.pbs02.pic.es task 1 terminated, sid=14189 01/02/2009
22:07:20;0008;   pbs_mom;Job;1661249.pbs02.pic.es;job was terminated
01/02/2009 22:07:20;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:20;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:20;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:20;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:20;0080;
pbs_mom;Job;1661249.pbs02.pic.es;obit sent to server 01/02/2009
22:07:20;0080;   pbs_mom;Job;1661249.pbs02.pic.es;removing transient
job directory /home/tmp/1661249.pbs02.pic.es 01/02/2009
22:07:21;0080;   pbs_mom;Job;1661255.pbs02.pic.es;scan_for_terminated:
job 1661255.pbs02.pic.es task 1 terminated, sid=14242 01/02/2009
22:07:21;0008;   pbs_mom;Job;1661255.pbs02.pic.es;job was terminated
01/02/2009 22:07:21;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:21;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:21;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:21;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:21;0080;
pbs_mom;Job;1661255.pbs02.pic.es;obit sent to server 01/02/2009
22:07:21;0080;   pbs_mom;Job;1661255.pbs02.pic.es;removing transient
job directory /home/tmp/1661255.pbs02.pic.es 01/02/2009
22:07:23;0080;   pbs_mom;Job;1661256.pbs02.pic.es;scan_for_terminated:
job 1661256.pbs02.pic.es task 1 terminated, sid=14128 01/02/2009
22:07:23;0008;   pbs_mom;Job;1661256.pbs02.pic.es;job was terminated
01/02/2009 22:07:23;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:23;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:23;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:24;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:24;0080;
pbs_mom;Job;1661256.pbs02.pic.es;obit sent to server 01/02/2009
22:07:24;0080;   pbs_mom;Job;1661248.pbs02.pic.es;scan_for_terminated:
job 1661248.pbs02.pic.es task 1 terminated, sid=14112 01/02/2009
22:07:24;0008;   pbs_mom;Job;1661248.pbs02.pic.es;job was terminated
01/02/2009 22:07:24;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:24;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:24;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:24;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:24;0080;
pbs_mom;Job;1661248.pbs02.pic.es;obit sent to server 01/02/2009
22:07:24;0080;   pbs_mom;Job;1661256.pbs02.pic.es;removing transient
job directory /home/tmp/1661256.pbs02.pic.es 01/02/2009
22:07:24;0080;   pbs_mom;Job;1661248.pbs02.pic.es;removing transient
job directory /home/tmp/1661248.pbs02.pic.es 01/02/2009
22:07:25;0080;   pbs_mom;Job;1661250.pbs02.pic.es;scan_for_terminated:
job 1661250.pbs02.pic.es task 1 terminated, sid=14146 01/02/2009
22:07:25;0008;   pbs_mom;Job;1661250.pbs02.pic.es;job was terminated
01/02/2009 22:07:25;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:25;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:25;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:25;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:25;0080;
pbs_mom;Job;1661250.pbs02.pic.es;obit sent to server 01/02/2009
22:07:26;0080;   pbs_mom;Job;1661250.pbs02.pic.es;removing transient
job directory /home/tmp/1661250.pbs02.pic.es 01/02/2009
22:07:30;0080;   pbs_mom;Job;1661252.pbs02.pic.es;scan_for_terminated:
job 1661252.pbs02.pic.es task 1 terminated, sid=14167 01/02/2009
22:07:30;0008;   pbs_mom;Job;1661252.pbs02.pic.es;job was terminated
01/02/2009 22:07:30;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:07:30;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:07:30;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:07:30;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:07:30;0080;
pbs_mom;Job;1661252.pbs02.pic.es;obit sent to server 01/02/2009
22:07:31;0080;   pbs_mom;Job;1661252.pbs02.pic.es;removing transient
job directory /home/tmp/1661252.pbs02.pic.es 01/02/2009
22:08:31;0080;   pbs_mom;Job;1661253.pbs02.pic.es;scan_for_terminated:
job 1661253.pbs02.pic.es task 1 terminated, sid=14173 01/02/2009
22:08:31;0008;   pbs_mom;Job;1661253.pbs02.pic.es;job was terminated
01/02/2009 22:08:32;0080;   pbs_mom;Svr;preobit_reply;top of
preobit_reply 01/02/2009 22:08:32;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop 01/02/2009 22:08:32;0080;
pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
01/02/2009 22:08:32;0008;   pbs_mom;Job;scan_for_terminated;checking
job post-processing routine 01/02/2009 22:08:32;0080;
pbs_mom;Job;1661253.pbs02.pic.es;obit sent to server 01/02/2009
22:08:32;0080;   pbs_mom;Job;1661253.pbs02.pic.es;removing transient
job directory /home/tmp/1661253.pbs02.pic.es


So, could someone please help to find the reason why maui is stopping
scheduling jobs after one node is empty?
And, why maui is not aware of REAL node status?

Oh, maui conf know torque server:
[root at pbs02 ~]# grep pbs02 /var/spool/maui/maui.cfg
SERVERHOST              pbs02.pic.es
ADMINHOST               pbs02.pic.es


versions:
[root at pbs02 ~]# rpm -qa|grep maui

maui-3.2.6p20-snap.1182974819.8.slc4
maui-client-3.2.6p20-snap.1182974819.8.slc4
maui-server-3.2.6p20-snap.1182974819.8.slc4
[root at pbs02 ~]# 
[root at pbs02 ~]# rpm -qa|grep torque
torque-2.3.0-snap.200801151629.2cri.slc4
glite-yaim-torque-server-4.0.1-5
torque-server-2.3.0-snap.200801151629.2cri.slc4
glite-yaim-torque-utils-4.0.2-2
torque-client-2.3.0-snap.200801151629.2cri.slc4

TIA,
Arnau


More information about the torqueusers mailing list