[torqueusers] pbs_mom endless kill loop
Kevin Murphy
murphy at genome.chop.edu
Wed Oct 22 09:17:28 MDT 2008
On Oct 21, 2008, at 2:34 PM, George Wm Turner wrote:
> I'll add a "me too!"
>
> I've seen it with versions up to torque 2.3.4; later version are
> better; i.e. not as likely to tip over into this mode (2.3.3,
> 2.3.4) 2.3.2 was very bad about getting into this state.
>
> I suspect with each iteration of the loop it opens another socket
> back to the pbs_server; I quickly run out of privileged ports and
> then NFS goes offline.
>
OK, just for the record, we've been having what I am now pretty sure
is a NIC driver problem, which sometimes causes kernel panics on the
head node. After the head node is restarted (and not before), the
moms invariably exhibit this problem.
-Kevin
More information about the torqueusers
mailing list