[torqueusers] jobs stuck

bill cluster.bill at alinto.com
Mon Sep 18 02:46:56 MDT 2006


Come back to work on monday and I saw every jobs stucks.
CPUs are up to 0% working.
show_pbs_res.py shows me:
Total nodes : 2
	Nodes with 4 CPU
		1 node with -8 CPU free
		1 node with 0 CPU free

I had 6 jobs running (sort of) each of them asking 2 CPU !
All jobs where located on the same node.

Where can I search to understand what happens this week-end?

I qdel these 6 jobs.

One job start (requesting 6 CPU).
I got a lot more asking for 2 CPU but they don't start.
qstat -f on these jobs show me:
comment = Not Running: Draining system to allow starving job to run
what happened to my cluster?

Where can I begin to search?


More information about the torqueusers mailing list