[torqueusers] Slow response of torque when jobs are running
Luc.Vereecken at chem.kuleuven.be
Tue Dec 8 04:29:53 MST 2009
I checked name resolution and other network related things I could
think of, but all network related stuff seems to work.
E.g. "getent hosts `cat server_name` " resolves instantaneously on
server and nodes, as do all the hosts listed in the nodes file. All
cluster nodes are in the /etc/hosts file, identical on all machines,
kept up to date with cfengine.
At 04:56 08/12/2009, Josh Bernstein wrote:
>I've gotta believe this is a name resolution issue.
>Can you check to make sure the hostnames in TORQUEs server_name file
>contain a hostname the resolves quickly with getent?
>On Dec 7, 2009, at 7:15 PM, "Garrick Staples" <garrick at usc.edu> wrote:
> > On Tue, Dec 08, 2009 at 01:39:38AM +0000, Luc Vereecken alleged:
> >> Hi Chris,
> >> I attach a strace -T output of qstat. The output looked like a normal
> >> qstat output with jobnumbers and running times etc, so nothing
> >> special
> >> there.
> >> The strace reveals that it all goes awry when accessing the
> >> /tmp/.torque-unix. Major time is lost on a poll (line 78) and a read
> >> (line 90), all other times look like normal timings.
> >> That reminds me that there is something like a no-unix-sockets option
> >> in configure, iirc.
> > What you want is an strace of the _server_ while doing a qstat.
> > qstat is just going to wait for a response from the server. Your
> > strace shows
> > exactly that.
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>torqueusers mailing list
>torqueusers at supercluster.org
More information about the torqueusers