[torqueusers] Jobs started on only a subset of the requested nodes

Per Lundqvist perl at nsc.liu.se
Wed Mar 22 10:12:03 MST 2006


Hi all,

since yesterday we have had problems with some jobs running on 
only a subset of the nodes that they have requested. For example, job 
16804 requested 9:ppn=2 nodes but ran only on 4 of these nodes (from 
/var/spool/PBS/server_priv/accounting/20060322):

03/22/2006 09:25:36;S;16804.torn;user=auser group=agroup
jobname=run_oa4_couprca queue=workq ctime=1143015930 qtime=1143015930 
etime=1143015930 start=1143015936 
exec_host\=n113/1+n113/0+n84/1+n84/0+n81/1+n81/0+n68/1+n68/0+n67/1+n67/0 
Resource_List.neednodes=n113:ppn=2+n84:ppn=2+n81:ppn=2+n68:ppn=2+n67:ppn=2+n129:ppn=2+n128:ppn=2+n127:ppn=2+n126:ppn=2 
Resource_List.nodect=9 Resource_List.nodes=9:ppn=2 
Resource_List.walltime=00:10:00

* All these jobs have in common that they have to preempt another job
   before being able to start (contents of moab.log):

03/22 09:25:30 INFO:     inadequate feasible tasks found for job 16804:0 (10 < 18)
03/22 09:25:30 INFO:     inadequate nodes found for job 16804:0 (5 < 9)
03/22 09:25:30 MJobSelectPJobList(16804,8,4,FJobList,FNL,PJList,PTCList,PNCList,PTL)
03/22 09:25:30 MRMJobRequeue(16803,JPeer,SC)
03/22 09:25:30 MPBSJobRequeue(16803,R,JPeer,EMsg,SC)
03/22 09:25:30 MPBSJobModify(16803,R,Resource_List,neednodes,NULL,EMsg,SC)
03/22 09:25:36 MRsvDestroy(16803,TRUE,TRUE)
03/22 09:25:36 INFO:     attribute 'PREEMPTEE' set for job 16803
03/22 09:25:36 INFO:     tasks located for job 16804:  18 of 18 required (42 feasible)
03/22 09:25:36 MJobStart(16804,EMsg)
03/22 09:25:36 MJobDistributeTasks(16804,base,TRUE,NodeList,STaskMap,0)
03/22 09:25:36 MAMAllocJReserve(16804,RIndex,EMsg)
03/22 09:25:36 MRMJobStart(16804,EMsg,SC)
03/22 09:25:36 MPBSJobStart(16804,base,EMsg,SC)
03/22 09:25:36 MPBSJobModify(16804,R,Resource_List,neednodes,n113:ppn=2+n84:ppn=2+n81:ppn=2+n68:ppn=2+n67:ppn=2+n129:ppn=2+n128:ppn=2+n127:ppn=2+n126:ppn=2,EMsg,SC)
03/22 09:25:36 MPBSJobModify(16804,R,Resource_List,neednodes,,EMsg,SC)
03/22 09:25:36 INFO:     job '16804' successfully started

* Where in this case the preemptee 16803 was running on both processors on
   nodes n[113-n128], but was not terminated until 09:29:11 (but the job
   16804 was started 09:25:36):

03/22/2006 09:20:31;0008;   pbs_mom;Job;16803.torn;JOIN JOB as node 1
03/22/2006 09:25:36;0100;   pbs_mom;Job;16803.torn;kill_job received
03/22/2006 09:25:44;0001;   pbs_mom;Job;TMomFinalizeJob3;job 16803.torn started, pid = 8847
03/22/2006 09:25:44;0008;   pbs_mom;Job;16803.torn;Job Modified at request of PBS_Server at n0
03/22/2006 09:29:11;0008;   pbs_mom;Job;16803.torn;kill_task: killing pid 8945 task 1 with sig 9
03/22/2006 09:29:11;0080;   pbs_mom;Job;16803.torn;scan_for_terminated: job 16803.torn task 1 terminated, sid 8847
03/22/2006 09:29:11;0008;   pbs_mom;Job;16803.torn;Terminated

* We are using torque-2.1.0p0-snap.200603131337 and
   moab-4.5.0p1-snap.1141259930. Has anybody else experienced anything
   similar?

best regards,
Per Lundqvist


-- 
Per Lundqvist

National Supercomputer Centre
Linköping University, Sweden

http://www.nsc.liu.se


More information about the torqueusers mailing list