[torqueusers] Cluster has enough CPUs but job refuses to start

Gus Correa gus at ldeo.columbia.edu
Tue Mar 15 17:45:39 MDT 2011


Jeremy Mann wrote:
> I recently added 4 additional nodes to our cluster and now queued jobs
> refuse to start and say the requested number of procs in partition DEFAULT
> has been exceeded.
> 
>  3 Active Jobs      96 of  144 Processors Active (66.67%)
>                         12 of   20 Nodes Active      (60.00%)
> 
> Total Jobs: 16   Active Jobs: 3   Idle Jobs: 0   Blocked Jobs: 13
> 
> The 13 "blocked" jobs are requesting a mix 32 and 16 cpus. Obviously we
> have enough CPUs left (96 out of 144 used), so I do not have a clue why
> these 13 jobs remain blocked. I tried using qalter to lower the amount
> requested, for example:
> 
> qalter -l nodes=2:ppn=4 4832
> 
> But jobid 4832 still says:
> 
> Holds:    Defer
> Messages:  exceeds available partition procs
> PE:  8.00  StartPriority:  119
> cannot select job 4832 for partition DEFAULT (job hold active)
> 
> 
Hi Jeremy

Did you update your Torque nodes file to include the new nodes? 
($TORQUE/server_priv/nodes)
Did you restart the Torque server after that?
(service pbs_server restart)
Do you use Maui scheduler?
Did you restart it? (service maui restart)
Is pbs_mom running on the new nodes?  ('pbsnodes' should tell)

Well, maybe you already did these things.

I hope this helps,
Gus Correa



More information about the torqueusers mailing list