[torqueusers] Torque/Maui scheduling to deadnodes?

Daniel.G.Roberts at sanofi-aventis.com Daniel.G.Roberts at sanofi-aventis.com
Tue Oct 2 09:35:15 MDT 2007


Hello All
We are running the following versions of torque/maui..
Maui version maui-3.2.6p16-snap.1157560841
 /opt/sched/commands/sbin/pbs_server --version
version: 2.1.6

My question is this..our cluster is quite busy..has about 100 nodes..
 
Every once in a great while the system goes haywire in the following
way..
 
We might have hundreds of jobs running without any problems..and then at
some point in time 10 nodes or so might become available..
Lets say it is nodes 1-10 that could be used..
What happens next is this..the queued jobs are then all scheduled
against node1 and fly through the system without ever scheduling jobs
against the remaining nodes 9-10.
When the user calls and says all his jobs have failed and we need to
figure out what has happened..we realize at this point that>
 
Node one is only pingable and we can't rsh into the target node1 to see
what is going on..When we go to the console of node1..we see that maybe
it has suffered a disk crash and is in a weird state still somewhat
limping along..
BUT from the scheduler point of view...pbsnodes -a reports the node1 as
free and it is pingable from the headnode...and because of such pbsnodes
status report all the queued jobs get delivered to node1 and
disappear...
How do we get around this problem?  Has this particular issue been
addressed in newer versions of maui/torque?
Thanks for any advice
Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20071002/c95a9b22/attachment.html


More information about the torqueusers mailing list