[torqueusers] Mother Superior Hangs when Sister Dies (4.2.2)
Joseph M Paris
j-paris at northwestern.edu
Mon Apr 22 21:16:19 MDT 2013
Dear All -
We have a situation where a multinode job was submitted in moab and later cancelled by the user.
For sake of argument, lets say the allocated nodes were: [qnode0441:8][qnode0245:8][qnode0250:8]....
We found that mom superior (qnode0441) was unresponsive resulting in repetitive 5 minute (300 second/TCP) timeouts for Moab, which was trying to start a job using qnode0441. This behavior is also observed when running any momctl command (local or remotely). For example, running momctl -q loadave would hang for 5 minutes, error about not being able to run the command, and then retry (up to 5 times i believe).
We tried issuing a restart to trqauthd and pbs_mom on mother superior. We even tried issuing a network restart followed by trqauthd and pbs_mom restarts. None of these could get the mom on this node to respond. We observed through the pbs_mom logs that qnode0441 was apparently fixated on maintaining communication with qnode0245 which we found was no longer on the network. We rebooted qnode0245. After the node came back on the network, and its mom started, qnode0441 stopped obsessing about communications and the mom became responsive again.
Several permutations of this behavior have been witnessed across our 800 node cluster.
Previous to this we were running a 2.5.x version of torque and when a sister died we wouldn't witness the hangs in moab. If this didn't result in hangs to the scheduler it wouldn't be a big deal. And i'm not really convinced this is a moab issue because it's the mother superior that appears to be hanging when losing a sister.
Are there any thoughts here? We've tried adjusting timeouts, node checks, reservation depth, etc, etc. We're at a loss. It just seems that mother superior is having a hard time letting go (pun sort of
Associate Director for Research Computing
Northwestern University Information Technology (NUIT)
1800 Sherman Suite 206
Evanston, IL 60208
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers