[torqueusers] pbs_mom endless kill loop
murphy at genome.chop.edu
Wed Oct 22 09:17:28 MDT 2008
On Oct 21, 2008, at 2:34 PM, George Wm Turner wrote:
> I'll add a "me too!"
> I've seen it with versions up to torque 2.3.4; later version are
> better; i.e. not as likely to tip over into this mode (2.3.3,
> 2.3.4) 2.3.2 was very bad about getting into this state.
> I suspect with each iteration of the loop it opens another socket
> back to the pbs_server; I quickly run out of privileged ports and
> then NFS goes offline.
OK, just for the record, we've been having what I am now pretty sure
is a NIC driver problem, which sometimes causes kernel panics on the
head node. After the head node is restarted (and not before), the
moms invariably exhibit this problem.
More information about the torqueusers