[torqueusers] Slow response of torque when jobs are running
Luc.Vereecken at chem.kuleuven.be
Tue Dec 8 05:13:09 MST 2009
At 01:52 08/12/2009, Garrick Staples wrote:
>On Tue, Dec 08, 2009 at 01:39:38AM +0000, Luc Vereecken alleged:
> > Hi Chris,
> > I attach a strace -T output of qstat. The output looked like a normal
> > qstat output with jobnumbers and running times etc, so nothing special
> > there.
> > The strace reveals that it all goes awry when accessing the
> > /tmp/.torque-unix. Major time is lost on a poll (line 78) and a read
> > (line 90), all other times look like normal timings.
> > That reminds me that there is something like a no-unix-sockets option
> > in configure, iirc.
>What you want is an strace of the _server_ while doing a qstat.
>qstat is just going to wait for a response from the server. Your strace shows
I already wondered why Chris wanted a strace of qstat :-) but at
1:30am I wasn't wondering too much about anything anymore...
I did a "strace -tt -T" of the pbs_server process; the trace starts
about 30 seconds before the start of the qstat commands, as I wanted
to show some other output independent of the qstat command that might
be related. The trace is rather large (120K) so it probably shouldn't
get mailed to the entire list. It is available at
while the qstat strace (for comparison) is available as
These straces contain the time stamps to be able to crosslink them.
I annotated the external events (qstat start, etc) in the
strace.pbs_server file (search for "===").
During normal operation of pbs_server, I regularly see messages with
"EAGAIN (Resource temporarily unavailable) <0.000003>" that might or
might not be related to the currrent problem.
While running the qstat, the time seems mostly to be lost contacting
each of the nodes, with the select taking about 3 seconds each. I
would have thought that the pbs_server might just pass on its current
information from local memory, but keeping its job information
current by visiting the nodes every so often during normal operation.
I have no exceptional delays accessing any of the compute nodes in
normal operation, i.e. submillisecond pings, fast ssh logins etc..
More information about the torqueusers