[Mauiusers] Reservation help

Charles Johnson charles.johnson at accre.vanderbilt.edu
Tue Oct 28 10:14:50 MDT 2008


I apologize for the cross-post, but I am in need of help.

Using moab server version 5.1.0p9 (snap NA) (rev. 8933) and
torque 2.3.0

Here is what has happened. We have a group using the cluster that  
needs 16GB
RAM. We added RAM to a compute node, and brought the node back up.  
Torque sees
the node, and pbsnodes reports the RAM at 16GB. This group was happy  
to submit
a job and know that in about 24 hours the job would run on the node. To
facilitate that I created a rolling reservation with a lead of 24  
hours. From
reading the documentation, I added the following lines to moab.cfg and
recycled the scheduler:

#Rolling reservation for rokaslabs
SRCFG[rokaslab] HOSTLIST=vmp001 ACCOUNTLIST=rokaslab GROUPLIST=*rokaslab
SRCFG[rokaslab] FLAGS=BYNAME
SRCFG[rokaslab] ROLLBACKOFFSET=24:00:00
SRCFG[rokaslab] PERIOD=INFINITY

moab reports that the reservation was created. Showres shows rokaslab. 
4732

As a test I had the user add the following lines to his pbs-script

#PBS -l mem=10000mb
#PBS -l advres:rokaslab.4732

checkjob reported that reservation rokaslab.4732 could not be found.  
Right
enough, showres showed rokaslab.4733

So, I had the user resubmit with

#PBS -l advres:rokaslab.4733

and checkjob reports that rokaslab.4733 could not be found. showres  
showed
rokaslab.4734

I had the user resubmut this way:

#PBS -l advres:rokaslab

Now the job in question appears in the idle queue; it shows that the  
job has
acquired a reservation for vmp001; however checkjob gives the  
following output:

checkjob 552072
job 552072

AName: pbs_velvet_21_2runs_Agambiae_10gigs_2
State: Idle
Creds:  user:gibbonjg  group:rokaslab  account:rokaslab  class:all
WallTime:   00:00:00 of 4:00:00
SubmitTime: Mon Oct 27 10:44:10
   (Time Queued  Total: 1:00:08:26  Eligible: 1:00:08:09)

Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Memory >= 5000M  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: x86
Dedicated Resources Per Task: PROCS: 1  MEM: 10000M

Reserved Nodes:  (23:59:43 -> 1:03:59:43  Duration: 4:00:00)
[vmp001:1]


BypassCount:    293
Partition Mask: [base]
Flags:          ADVRES:rokaslab,RESTARTABLE
Attr:           checkpoint
StartPriority:  2680
EstimatedStart: H: -86786  R: -518  B: 54214
PE:             11.01
Reservation '552072' (23:59:43 -> 1:03:59:43  Duration: 4:00:00)
rejected for Features     -
rejected for CPU          -
rejected for Memory       -
rejected for State        -
NOTE:  job cannot run in partition base (idle procs do not meet  
requirements : 0 of 1 procs found)
idle procs: 420  feasible procs:   0

Node Rejection Summary: [Features: 214][CPU: 1][Memory: 121][State: 288]

So, vmp001 is acquired, but I get the "NOTE: job cannot run in  
partition base
(idle procs do not meet requirements : 0 of 1 procs found)" Moreover,
showstart reports that the job will start in 24 hours. Waiting an hour,
showstart will again report that the job will start in 24 hours.

Clearly, I am missing something, or the reservation is mis-configured,  
or I
haven't given Torque sufficient info to report the available RAM, or  
all of these.

I would really appreciate some help; the users are beginning to get  
impatient.

TIA

Charles
---
Charles Johnson
Advanced Computing Center for Research and Education
Vanderbilt University
charles.johnson at accre.vanderbilt.edu
Office: 615-343-2776
Cell: 615-478-8799






More information about the mauiusers mailing list