[torqueusers] Re: maui performance problem

Garrick Staples garrick at usc.edu
Wed Feb 9 21:06:45 MST 2005


On Wed, Feb 09, 2005 at 03:40:23PM -0800, Garrick Staples alleged:
> I have a user that has submitted a 1000 jobs, most of which with 1000
> dependencies.  This makes the job information very large.  Maui, on each
> scheduler iteration, is "downloading" all of that information from pbs_server;
> and it's being very slow about it.
> 
> This is the production environment, so I'm not up to the latest revs yet:
> torque-1.1.0p4-snap.1099003850
> maui-3.2.6p10-snap.1095450030
> 
> strace shows pbs_server is doing very large writes, and maui is spinning it's
> wheels doing *single byte* reads!  This process takes about 2 minutes and
> pbs_server and maui are both unresponsive during this time.

After searching through the code, I realize now that maui just uses torque's
pbs_ifl library to retrieve node/job information.

I think the problem is in pbs_disconnect().  If there are bytes on the wire,
pbs_disconnect() will loop over 'read(fd,&x,1)' before actually closing the
socket.  If there happens to be a lot of bytes on the wire, this can be really
slow.

Every scheduling iteration, maui disconnects and reconnects to pbs_server.  If
a timeout occurs in the non-blocking read() deep down inside of, say,
pbs_statnode(), then maui ignores it and will disconnect at the beginning of
the next scheduling iteration.  

However, what happens if pbs_server got the request, fulfilled it, and sent off
that data?  I think pbs_server ends up blocking for maui to read those bytes,
one at a time.  On a large cluster (1364 nodes and ~2000 jobs) this can take a
few minutes.

Does this make any sense?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050209/e9aa7410/attachment.bin


More information about the torqueusers mailing list