[torqueusers] Re: problem at 75 nodes
Tim Freeman
tfreeman at mcs.anl.gov
Thu Jan 29 09:45:02 MST 2009
On Thu, 29 Jan 2009 10:43:09 -0600
Tim Freeman <tfreeman at mcs.anl.gov> wrote:
> We're launching clusters of different sizes using Torque on EC2 and around 75
> compute nodes seeing some issues.
>
> Setup:
>
> The /home directory is on NFS from an NFS server node and the stdout and
> stderr of the job are redirected to local storage on each compute (except for
> some preamble to the job I am told, so there are some 8K stdout files that
> Torque handles).
Torque 2.1.8, sorry.
Thanks,
Tim
> Pretty straightforward.
>
>
> Two problems start flooding the logs:
>
> PBS_Server;Req;?;req body bad, dis error 1 (Input value too large to convert
> to this type), type=LocateJob
>
> PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request
> Protocol MSG=cannot decode message), aux=0, type=LocateJob, from torqueuser@
>
>
> Most of the log messages have torqueuser at NODE but this just has "torqueuser@"
>
> I don't find anything for these errors using search engines. Any ideas?
>
> Thanks,
> Tim
More information about the torqueusers
mailing list