[torqueusers] Cluster has enough CPUs but job refuses to start
gus at ldeo.columbia.edu
Tue Mar 15 17:45:39 MDT 2011
Jeremy Mann wrote:
> I recently added 4 additional nodes to our cluster and now queued jobs
> refuse to start and say the requested number of procs in partition DEFAULT
> has been exceeded.
> 3 Active Jobs 96 of 144 Processors Active (66.67%)
> 12 of 20 Nodes Active (60.00%)
> Total Jobs: 16 Active Jobs: 3 Idle Jobs: 0 Blocked Jobs: 13
> The 13 "blocked" jobs are requesting a mix 32 and 16 cpus. Obviously we
> have enough CPUs left (96 out of 144 used), so I do not have a clue why
> these 13 jobs remain blocked. I tried using qalter to lower the amount
> requested, for example:
> qalter -l nodes=2:ppn=4 4832
> But jobid 4832 still says:
> Holds: Defer
> Messages: exceeds available partition procs
> PE: 8.00 StartPriority: 119
> cannot select job 4832 for partition DEFAULT (job hold active)
Did you update your Torque nodes file to include the new nodes?
Did you restart the Torque server after that?
(service pbs_server restart)
Do you use Maui scheduler?
Did you restart it? (service maui restart)
Is pbs_mom running on the new nodes? ('pbsnodes' should tell)
Well, maybe you already did these things.
I hope this helps,
More information about the torqueusers