[Mauiusers] Why a rorque/maui job won't start?
Jim Kusznir
jkusznir at gmail.com
Fri Oct 3 10:17:00 MDT 2008
Hello:
As I looked through the job queue on my cluster, I'm finding myself
mystified....I have one job that just won't start, and I can't figure
out why:
[root at aeolus changhun]# qstat
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
4428.aeolus CMAQ.aug.benz ramos 21:00:00 R default
4429.aeolus CMAQ.dec.benz ramos 32:31:14 R default
4437.aeolus hsa_xml.sh changhun 0 Q default
4442.aeolus for.chem.ga2 sledburg 2095:20: R default
4483.aeolus mem2Rjob2 wdavis 258:09:4 R default
Job 4437 caught my attention, as it appears it should have started
before 4442 and 4483, both of which want way more resources than it
does. In addtion, at this moment I have 1 node available, and each of
my nodes have 8 cores and 8GB ram. The users' job script reads:
[root at aeolus hsa_xml]# more hsa_xml.sh
#PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00
#PBS -m abe
#PBS -M <deleted>
# copy qsub's env to the job
#PBS -V
cd $PBS_O_WORKDIR
mpirun mpi_subdue -limit 100 hsa_xml.g
I'm still not entirely sure what the mem= flag is supposed to set, but
in any case, here's what checkjob says:
[root at aeolus hsa_xml]# checkjob 4437
checking job 4437
State: Idle
Creds: user:changhun group:changhun class:default qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed Oct 1 10:21:03
(Time Queued Total: 1:22:49:15 Eligible: 00:00:00)
Total Tasks: 4
Req[0] TaskCount: 4 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Dedicated Resources Per Task: PROCS: 1 MEM: 2048M
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
Holds: Batch (hold reason: NoResources)
Messages: cannot create reservation for job '4437' (intital
reservation attempt)
PE: 8.97 StartPriority: 540
cannot select job 4437 for partition DEFAULT (job hold active)
>From this, it appears its trying to schedule 4 processes, with each
process having 2 gig of RAM. Worst case is that 3 processes run on
one node and the 4th on another....This has been available several
times since it's been queued. Why won't this job run?
I suspect if the user removes the mem= limit, it will run, but this
still leaves the question as to "why"
--Jim
More information about the mauiusers
mailing list