[Mauiusers] Why a rorque/maui job won't start?

Jim Kusznir jkusznir at gmail.com
Fri Oct 3 10:17:00 MDT 2008


Hello:

As I looked through the job queue on my cluster, I'm finding myself
mystified....I have one job that just won't start, and I can't figure
out why:

[root at aeolus changhun]# qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
4428.aeolus         CMAQ.aug.benz    ramos           21:00:00 R default
4429.aeolus         CMAQ.dec.benz    ramos           32:31:14 R default
4437.aeolus         hsa_xml.sh       changhun               0 Q default
4442.aeolus         for.chem.ga2     sledburg        2095:20: R default
4483.aeolus         mem2Rjob2        wdavis          258:09:4 R default


Job 4437 caught my attention, as it appears it should have started
before 4442 and 4483, both of which want way more resources than it
does.  In addtion, at this moment I have 1 node available, and each of
my nodes have 8 cores and 8GB ram.  The users' job script reads:

[root at aeolus hsa_xml]# more hsa_xml.sh
#PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00
#PBS -m abe
#PBS -M <deleted>
# copy qsub's env to the job
#PBS -V

cd $PBS_O_WORKDIR
mpirun mpi_subdue -limit 100 hsa_xml.g

I'm still not entirely sure what the mem= flag is supposed to set, but
in any case, here's what checkjob says:

[root at aeolus hsa_xml]# checkjob 4437


checking job 4437

State: Idle
Creds:  user:changhun  group:changhun  class:default  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed Oct  1 10:21:03
  (Time Queued  Total: 1:22:49:15  Eligible: 00:00:00)

Total Tasks: 4

Req[0]  TaskCount: 4  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

Holds:    Batch  (hold reason:  NoResources)
Messages:  cannot create reservation for job '4437' (intital
reservation attempt)

PE:  8.97  StartPriority:  540
cannot select job 4437 for partition DEFAULT (job hold active)

>From this, it appears its trying to schedule 4 processes, with each
process having 2 gig of RAM.  Worst case is that 3 processes run on
one node and the 4th on another....This has been available several
times since it's been queued.  Why won't this job run?

I suspect if the user removes the mem= limit, it will run, but this
still leaves the question as to "why"

--Jim


More information about the mauiusers mailing list