On Tue, Nov 16, 2004 at 08:59:55AM -0500, Chris Johnson alleged:
>      Hi people,
>      We have a strange problem that we saw in OpenPBS as well a few
> years ago.  This caused use to switch to PBSPro which seems to have
> solved the problem.  We were seeing this in Open on earlier RH
> versions.  We're currently seeing it with torque 1.0.1p6 in FC2.
>      Our nodes sometimes go into a comatose state in which they can
> be pinged but not ssh'ed or rsh'ed to at all.  The network seems up
> but many of the higher functions are none responsive.  Torque, and
> earlier OpenPBS, don't respond well at all.  The master doesn't detect
> these nodes as being down.  The scheduler thinks they're still up and
> hangs on them unless they are specifically marked as offline.  This of
> course makes scheduling jobs a real pain because it doesn't happen
> very fast if at all. 

Yah, this is a fun one.  pbs_server sits around piling up TCP connections to
your hanging node too.

# lsof -p 28228 | grep hpc0799 | wc -l

A snippet of lsof...

pbs_serve 28228 root   21u  IPv4            2737954             TCP hpc-pbs.usc.edu:1003->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   22u  IPv4            2736870             TCP hpc-pbs.usc.edu:989->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   23u  IPv4            2737315             TCP hpc-pbs.usc.edu:988->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   24u  IPv4            2738508             TCP hpc-pbs.usc.edu:1002->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   25u  IPv4            2739153             TCP hpc-pbs.usc.edu:1000->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   26u  IPv4            2738514             TCP hpc-pbs.usc.edu:1001->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   27u  IPv4            2746471             TCP hpc-pbs.usc.edu:991->hpc0799.usc.edu:pbs_mom (ESTABLISHED)

