[torqueusers] Re: problem at 75 nodes

Tim Freeman tfreeman at mcs.anl.gov
Thu Jan 29 09:45:02 MST 2009


On Thu, 29 Jan 2009 10:43:09 -0600
Tim Freeman <tfreeman at mcs.anl.gov> wrote:

> We're launching clusters of different sizes using Torque on EC2 and around 75
> compute nodes seeing some issues.
> 
> Setup:
> 
> The /home directory is on NFS from an NFS server node and the stdout and
> stderr of the job are redirected to local storage on each compute (except for
> some preamble to the job I am told, so there are some 8K stdout files that
> Torque handles). 

Torque 2.1.8, sorry.

Thanks,
Tim

> Pretty straightforward.
> 
> 
> Two problems start flooding the logs:
> 
> PBS_Server;Req;?;req body bad, dis error 1 (Input value too large to convert
> to this type), type=LocateJob
> 
> PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request
> Protocol MSG=cannot decode message), aux=0, type=LocateJob, from torqueuser@
> 
> 
> Most of the log messages have torqueuser at NODE but this just has "torqueuser@"
> 
> I don't find anything for these errors using search engines.  Any ideas?
> 
> Thanks,
> Tim


More information about the torqueusers mailing list