[torqueusers] Re: problem at 75 nodes

Tim Freeman tfreeman at mcs.anl.gov
Fri Jan 30 09:52:18 MST 2009


On Thu, 29 Jan 2009 18:12:03 -0700
Garrick Staples <garrick at usc.edu> wrote:

> On Thu, Jan 29, 2009 at 10:43:09AM -0600, Tim Freeman alleged:  
> > We're launching clusters of different sizes using Torque on EC2 and around
> > 75 compute nodes seeing some issues.
> > 
> > Setup:
> > 
> > The /home directory is on NFS from an NFS server node and the stdout and
> > stderr of the job are redirected to local storage on each compute (except
> > for some preamble to the job I am told, so there are some 8K stdout files
> > that Torque handles).  Pretty straightforward.
> > 
> > 
> > Two problems start flooding the logs:
> > 
> > PBS_Server;Req;?;req body bad, dis error 1 (Input value too large to
> > convert to this type), type=LocateJob
> > 
> > PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request
> > Protocol MSG=cannot decode message), aux=0, type=LocateJob, from torqueuser@
> > 
> > 
> > Most of the log messages have torqueuser at NODE but this just has
> > "torqueuser@"
> > 
> > I don't find anything for these errors using search engines.  Any ideas?  
> 
> The LocateJob DIS request is coming from outside of pbs_server.  While this is
> happening, isolate the client that is triggering these messages and then we
> can look at what is happening.  Note the client might be the scheduler.
>   

I was able to isolate this to a non-Torque related client in the 'grid' stack
(sigh).

Thankyou,
Tim


More information about the torqueusers mailing list