problem at 75 nodes

Tim Freeman tfreeman at mcs.anl.gov
Thu Jan 29 09:43:09 MST 2009

We're launching clusters of different sizes using Torque on EC2 and around 75
compute nodes seeing some issues.


The /home directory is on NFS from an NFS server node and the stdout and stderr
of the job are redirected to local storage on each compute (except for some
preamble to the job I am told, so there are some 8K stdout files that Torque
handles).  Pretty straightforward.

Two problems start flooding the logs:

PBS_Server;Req;?;req body bad, dis error 1 (Input value too large to convert to
this type), type=LocateJob

PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request
Protocol MSG=cannot decode message), aux=0, type=LocateJob, from torqueuser@

Most of the log messages have torqueuser at NODE but this just has "torqueuser@"

I don't find anything for these errors using search engines.  Any ideas?


