[torqueusers] Re: problem at 75 nodes
Tim Freeman
tfreeman at mcs.anl.gov
Fri Jan 30 09:52:18 MST 2009
On Thu, 29 Jan 2009 18:12:03 -0700
Garrick Staples <garrick at usc.edu> wrote:
> On Thu, Jan 29, 2009 at 10:43:09AM -0600, Tim Freeman alleged:
> > We're launching clusters of different sizes using Torque on EC2 and around
> > 75 compute nodes seeing some issues.
> >
> > Setup:
> >
> > The /home directory is on NFS from an NFS server node and the stdout and
> > stderr of the job are redirected to local storage on each compute (except
> > for some preamble to the job I am told, so there are some 8K stdout files
> > that Torque handles). Pretty straightforward.
> >
> >
> > Two problems start flooding the logs:
> >
> > PBS_Server;Req;?;req body bad, dis error 1 (Input value too large to
> > convert to this type), type=LocateJob
> >
> > PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request
> > Protocol MSG=cannot decode message), aux=0, type=LocateJob, from torqueuser@
> >
> >
> > Most of the log messages have torqueuser at NODE but this just has
> > "torqueuser@"
> >
> > I don't find anything for these errors using search engines. Any ideas?
>
> The LocateJob DIS request is coming from outside of pbs_server. While this is
> happening, isolate the client that is triggering these messages and then we
> can look at what is happening. Note the client might be the scheduler.
>
I was able to isolate this to a non-Torque related client in the 'grid' stack
(sigh).
Thankyou,
Tim
More information about the torqueusers
mailing list