[torqueusers] problem at 75 nodes
garrick at usc.edu
Thu Jan 29 17:12:03 MST 2009
On Thu, Jan 29, 2009 at 10:43:09AM -0600, Tim Freeman alleged:
> We're launching clusters of different sizes using Torque on EC2 and around 75
> compute nodes seeing some issues.
> The /home directory is on NFS from an NFS server node and the stdout and stderr
> of the job are redirected to local storage on each compute (except for some
> preamble to the job I am told, so there are some 8K stdout files that Torque
> handles). Pretty straightforward.
> Two problems start flooding the logs:
> PBS_Server;Req;?;req body bad, dis error 1 (Input value too large to convert to
> this type), type=LocateJob
> PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request
> Protocol MSG=cannot decode message), aux=0, type=LocateJob, from torqueuser@
> Most of the log messages have torqueuser at NODE but this just has "torqueuser@"
> I don't find anything for these errors using search engines. Any ideas?
The LocateJob DIS request is coming from outside of pbs_server. While this is
happening, isolate the client that is triggering these messages and then we can
look at what is happening. Note the client might be the scheduler.
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
See the Prop 8 Dishonor Roll at http://www.californiansagainsthate.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090129/e99c5d42/attachment.bin
More information about the torqueusers