[torqueusers] pbs_mom endless kill loop

Kevin Murphy murphy at genome.chop.edu
Wed Oct 22 09:17:28 MDT 2008


On Oct 21, 2008, at 2:34 PM, George Wm Turner wrote:
> I'll add a "me too!"
>
> I've seen it with versions up to torque 2.3.4;  later version are  
> better; i.e. not as likely to tip over into this mode (2.3.3,  
> 2.3.4)  2.3.2 was very bad about getting into this state.
>
> I suspect with each iteration of the loop it opens another socket  
> back to the pbs_server; I quickly run out of privileged ports and  
> then NFS goes offline.
>

OK, just for the record, we've been having what I am now pretty sure  
is a NIC driver problem, which sometimes causes kernel panics on the  
head node.  After the head node is restarted (and not before), the  
moms invariably exhibit this problem.

-Kevin



More information about the torqueusers mailing list