[torqueusers] Re: maui performance problem
garrick at usc.edu
Wed Feb 9 21:06:45 MST 2005
On Wed, Feb 09, 2005 at 03:40:23PM -0800, Garrick Staples alleged:
> I have a user that has submitted a 1000 jobs, most of which with 1000
> dependencies. This makes the job information very large. Maui, on each
> scheduler iteration, is "downloading" all of that information from pbs_server;
> and it's being very slow about it.
> This is the production environment, so I'm not up to the latest revs yet:
> strace shows pbs_server is doing very large writes, and maui is spinning it's
> wheels doing *single byte* reads! This process takes about 2 minutes and
> pbs_server and maui are both unresponsive during this time.
After searching through the code, I realize now that maui just uses torque's
pbs_ifl library to retrieve node/job information.
I think the problem is in pbs_disconnect(). If there are bytes on the wire,
pbs_disconnect() will loop over 'read(fd,&x,1)' before actually closing the
socket. If there happens to be a lot of bytes on the wire, this can be really
Every scheduling iteration, maui disconnects and reconnects to pbs_server. If
a timeout occurs in the non-blocking read() deep down inside of, say,
pbs_statnode(), then maui ignores it and will disconnect at the beginning of
the next scheduling iteration.
However, what happens if pbs_server got the request, fulfilled it, and sent off
that data? I think pbs_server ends up blocking for maui to read those bytes,
one at a time. On a large cluster (1364 nodes and ~2000 jobs) this can take a
Does this make any sense?
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050209/e9aa7410/attachment.bin
More information about the torqueusers