[torqueusers] possible protocol problem.

Garrick Staples garrick at usc.edu
Tue Nov 16 10:27:00 MST 2004


On Tue, Nov 16, 2004 at 08:59:55AM -0500, Chris Johnson alleged:
>      Hi people,
> 
>      We have a strange problem that we saw in OpenPBS as well a few
> years ago.  This caused use to switch to PBSPro which seems to have
> solved the problem.  We were seeing this in Open on earlier RH
> versions.  We're currently seeing it with torque 1.0.1p6 in FC2.
> 
>      Our nodes sometimes go into a comatose state in which they can
> be pinged but not ssh'ed or rsh'ed to at all.  The network seems up
> but many of the higher functions are none responsive.  Torque, and
> earlier OpenPBS, don't respond well at all.  The master doesn't detect
> these nodes as being down.  The scheduler thinks they're still up and
> hangs on them unless they are specifically marked as offline.  This of
> course makes scheduling jobs a real pain because it doesn't happen
> very fast if at all. 

Yah, this is a fun one.  pbs_server sits around piling up TCP connections to
your hanging node too.

# lsof -p 28228 | grep hpc0799 | wc -l
     30

A snippet of lsof...

pbs_serve 28228 root   21u  IPv4            2737954             TCP hpc-pbs.usc.edu:1003->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   22u  IPv4            2736870             TCP hpc-pbs.usc.edu:989->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   23u  IPv4            2737315             TCP hpc-pbs.usc.edu:988->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   24u  IPv4            2738508             TCP hpc-pbs.usc.edu:1002->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   25u  IPv4            2739153             TCP hpc-pbs.usc.edu:1000->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   26u  IPv4            2738514             TCP hpc-pbs.usc.edu:1001->hpc0799.usc.edu:pbs_mom (ESTABLISHED)
pbs_serve 28228 root   27u  IPv4            2746471             TCP hpc-pbs.usc.edu:991->hpc0799.usc.edu:pbs_mom (ESTABLISHED)

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041116/9adedffa/attachment.bin


More information about the torqueusers mailing list