[torqueusers] Cluster has enough CPUs but job refuses to start
Gus Correa
gus at ldeo.columbia.edu
Tue Mar 15 17:45:39 MDT 2011
Jeremy Mann wrote:
> I recently added 4 additional nodes to our cluster and now queued jobs
> refuse to start and say the requested number of procs in partition DEFAULT
> has been exceeded.
>
> 3 Active Jobs 96 of 144 Processors Active (66.67%)
> 12 of 20 Nodes Active (60.00%)
>
> Total Jobs: 16 Active Jobs: 3 Idle Jobs: 0 Blocked Jobs: 13
>
> The 13 "blocked" jobs are requesting a mix 32 and 16 cpus. Obviously we
> have enough CPUs left (96 out of 144 used), so I do not have a clue why
> these 13 jobs remain blocked. I tried using qalter to lower the amount
> requested, for example:
>
> qalter -l nodes=2:ppn=4 4832
>
> But jobid 4832 still says:
>
> Holds: Defer
> Messages: exceeds available partition procs
> PE: 8.00 StartPriority: 119
> cannot select job 4832 for partition DEFAULT (job hold active)
>
>
Hi Jeremy
Did you update your Torque nodes file to include the new nodes?
($TORQUE/server_priv/nodes)
Did you restart the Torque server after that?
(service pbs_server restart)
Do you use Maui scheduler?
Did you restart it? (service maui restart)
Is pbs_mom running on the new nodes? ('pbsnodes' should tell)
Well, maybe you already did these things.
I hope this helps,
Gus Correa
More information about the torqueusers
mailing list