[torqueusers] performance problem on x86_64

Wightman wightman at clusterresources.com
Thu Oct 6 15:39:17 MDT 2005


Although my own system is MUCH smaller, it is x86_64 (mix of debian and
fedora).  I see no slowdown at all with any client commands. (our other
x86_64 systems include fedora4 and centos4...no reports of slowdown).

Just FYI.

- Douglas

On Thu, 2005-10-06 at 13:35 -0700, Garrick Staples wrote:
> On Thu, Oct 06, 2005 at 10:43:43AM -0700, Garrick Staples alleged:
> > I'm getting plagued by a strange performance problem in x86_64 TORQUE.  It's
> > driving me nuts.
> > 
> > Multiple, quick stats of jobs or nodes are very very slow when run on any x86_64
> > host.  The examples below work fine if I run it from any 32bit hosts.  And it
> > seems to only happen when a lot of single-node jobs are in the queue (running or
> > idle).  (Dave, I think you've seen this happen on TeraGrid)
> 
> I've found that the problem is inside of pbs_iff, but I can't figure out
> why.  This is cleaned up slightly with the attached patch:
> 
> # ./pbs_iff -t hpc-pbs 15001; strace -r ./pbs_iff -t hpc-pbs 15001
> ...
>      0.000000 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
>      0.000000 fcntl(3, F_GETFL)         = 0x2 (flags O_RDWR)
>      0.000000 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>      0.000000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, "\1\0\0\0\0\0\0\0", 8) = 0
>      0.000000 bind(3, {sa_family=AF_INET, sin_port=htons(1023), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
>      0.000000 connect(3, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("10.125.0.205")}, 16) = -1 EINPROGRESS (Operation now in progress)
>      0.000000 select(1024, NULL, [3], NULL, {5, 0}^[[A) = 1 (out [3], left {2, 0})
>      3.000000 getsockopt(3, SOL_SOCKET, SO_ERROR, [17179869184], [4]) = 0
> 
> 
> Note that the select() call takes 3 seconds!  Every time it fails, it is
> always precisely 3 seconds.
> 
> I also tried removing the O_NONBLOCK and SO_REUSEADDR bits, but that
> didn't effect it either.  I'm thinking this is a Linux (RHEL3) bug.
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list