[Mauiusers] Why a rorque/maui job won't start?

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Fri Oct 3 11:24:37 MDT 2008


>From the line "#PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00," the user is saying, "I Need one node with four processors and 8 GB of RAM for one hour."  If no nodes in your cluster have that configuration (four cores && 8 GB RAM), that's why it's blocked.  
 
There's no way this job will be able to be scheduled to run in a setup of "Worst case is that 3 processes run on one node and the 4th on another," because the user only requested one node.
 
--Joe

________________________________

From: mauiusers-bounces at supercluster.org on behalf of Jim Kusznir
Sent: Fri 10/3/2008 12:17 PM
To: Discussion of Rocks Clusters; mauiusers at supercluster.org
Subject: [Mauiusers] Why a rorque/maui job won't start?



Hello:

As I looked through the job queue on my cluster, I'm finding myself
mystified....I have one job that just won't start, and I can't figure
out why:

[root at aeolus changhun]# qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
4428.aeolus         CMAQ.aug.benz    ramos           21:00:00 R default
4429.aeolus         CMAQ.dec.benz    ramos           32:31:14 R default
4437.aeolus         hsa_xml.sh       changhun               0 Q default
4442.aeolus         for.chem.ga2     sledburg        2095:20: R default
4483.aeolus         mem2Rjob2        wdavis          258:09:4 R default


Job 4437 caught my attention, as it appears it should have started
before 4442 and 4483, both of which want way more resources than it
does.  In addtion, at this moment I have 1 node available, and each of
my nodes have 8 cores and 8GB ram.  The users' job script reads:

[root at aeolus hsa_xml]# more hsa_xml.sh
#PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00
#PBS -m abe
#PBS -M <deleted>
# copy qsub's env to the job
#PBS -V

cd $PBS_O_WORKDIR
mpirun mpi_subdue -limit 100 hsa_xml.g

I'm still not entirely sure what the mem= flag is supposed to set, but
in any case, here's what checkjob says:

[root at aeolus hsa_xml]# checkjob 4437


checking job 4437

State: Idle
Creds:  user:changhun  group:changhun  class:default  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed Oct  1 10:21:03
  (Time Queued  Total: 1:22:49:15  Eligible: 00:00:00)

Total Tasks: 4

Req[0]  TaskCount: 4  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

Holds:    Batch  (hold reason:  NoResources)
Messages:  cannot create reservation for job '4437' (intital
reservation attempt)

PE:  8.97  StartPriority:  540
cannot select job 4437 for partition DEFAULT (job hold active)

>From this, it appears its trying to schedule 4 processes, with each
process having 2 gig of RAM.  Worst case is that 3 processes run on
one node and the 4th on another....This has been available several
times since it's been queued.  Why won't this job run?

I suspect if the user removes the mem= limit, it will run, but this
still leaves the question as to "why"

--Jim
_______________________________________________
mauiusers mailing list
mauiusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20081003/380e9ac0/attachment.html


More information about the mauiusers mailing list