[torqueusers] possible protocol problem.
Garrick Staples
garrick at usc.edu
Tue Nov 16 10:27:00 MST 2004
On Tue, Nov 16, 2004 at 08:59:55AM -0500, Chris Johnson alleged:
> Hi people,
>
> We have a strange problem that we saw in OpenPBS as well a few
> years ago. This caused use to switch to PBSPro which seems to have
> solved the problem. We were seeing this in Open on earlier RH
> versions. We're currently seeing it with torque 1.0.1p6 in FC2.
>
> Our nodes sometimes go into a comatose state in which they can
> be pinged but not ssh'ed or rsh'ed to at all. The network seems up
> but many of the higher functions are none responsive. Torque, and
> earlier OpenPBS, don't respond well at all. The master doesn't detect
> these nodes as being down. The scheduler thinks they're still up and
> hangs on them unless they are specifically marked as offline. This of
> course makes scheduling jobs a real pain because it doesn't happen
> very fast if at all.
Yah, this is a fun one. pbs_server sits around piling up TCP connections to
your hanging node too.
# lsof -p 28228 | grep hpc0799 | wc -l
30
A snippet of lsof...
pbs_serve 28228 root 21u IPv4 2737954 TCP hpc-pbs.usc.edu:1003->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root 22u IPv4 2736870 TCP hpc-pbs.usc.edu:989->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root 23u IPv4 2737315 TCP hpc-pbs.usc.edu:988->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root 24u IPv4 2738508 TCP hpc-pbs.usc.edu:1002->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root 25u IPv4 2739153 TCP hpc-pbs.usc.edu:1000->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root 26u IPv4 2738514 TCP hpc-pbs.usc.edu:1001->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root 27u IPv4 2746471 TCP hpc-pbs.usc.edu:991->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041116/9adedffa/attachment.bin
More information about the torqueusers
mailing list