[Mauiusers] Why a rorque/maui job won't start?
Greenseid, Joseph M.
Joseph.Greenseid at ngc.com
Fri Oct 3 11:24:37 MDT 2008
>From the line "#PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00," the user is saying, "I Need one node with four processors and 8 GB of RAM for one hour." If no nodes in your cluster have that configuration (four cores && 8 GB RAM), that's why it's blocked.
There's no way this job will be able to be scheduled to run in a setup of "Worst case is that 3 processes run on one node and the 4th on another," because the user only requested one node.
From: mauiusers-bounces at supercluster.org on behalf of Jim Kusznir
Sent: Fri 10/3/2008 12:17 PM
To: Discussion of Rocks Clusters; mauiusers at supercluster.org
Subject: [Mauiusers] Why a rorque/maui job won't start?
As I looked through the job queue on my cluster, I'm finding myself
mystified....I have one job that just won't start, and I can't figure
[root at aeolus changhun]# qstat
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
4428.aeolus CMAQ.aug.benz ramos 21:00:00 R default
4429.aeolus CMAQ.dec.benz ramos 32:31:14 R default
4437.aeolus hsa_xml.sh changhun 0 Q default
4442.aeolus for.chem.ga2 sledburg 2095:20: R default
4483.aeolus mem2Rjob2 wdavis 258:09:4 R default
Job 4437 caught my attention, as it appears it should have started
before 4442 and 4483, both of which want way more resources than it
does. In addtion, at this moment I have 1 node available, and each of
my nodes have 8 cores and 8GB ram. The users' job script reads:
[root at aeolus hsa_xml]# more hsa_xml.sh
#PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00
#PBS -m abe
#PBS -M <deleted>
# copy qsub's env to the job
mpirun mpi_subdue -limit 100 hsa_xml.g
I'm still not entirely sure what the mem= flag is supposed to set, but
in any case, here's what checkjob says:
[root at aeolus hsa_xml]# checkjob 4437
checking job 4437
Creds: user:changhun group:changhun class:default qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed Oct 1 10:21:03
(Time Queued Total: 1:22:49:15 Eligible: 00:00:00)
Total Tasks: 4
Req TaskCount: 4 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Dedicated Resources Per Task: PROCS: 1 MEM: 2048M
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
Holds: Batch (hold reason: NoResources)
Messages: cannot create reservation for job '4437' (intital
PE: 8.97 StartPriority: 540
cannot select job 4437 for partition DEFAULT (job hold active)
>From this, it appears its trying to schedule 4 processes, with each
process having 2 gig of RAM. Worst case is that 3 processes run on
one node and the 4th on another....This has been available several
times since it's been queued. Why won't this job run?
I suspect if the user removes the mem= limit, it will run, but this
still leaves the question as to "why"
mauiusers mailing list
mauiusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mauiusers