[torqueusers] problem at 75 nodes

Garrick Staples garrick at usc.edu
Thu Jan 29 17:12:03 MST 2009


On Thu, Jan 29, 2009 at 10:43:09AM -0600, Tim Freeman alleged:
> We're launching clusters of different sizes using Torque on EC2 and around 75
> compute nodes seeing some issues.
> 
> Setup:
> 
> The /home directory is on NFS from an NFS server node and the stdout and stderr
> of the job are redirected to local storage on each compute (except for some
> preamble to the job I am told, so there are some 8K stdout files that Torque
> handles).  Pretty straightforward.
> 
> 
> Two problems start flooding the logs:
> 
> PBS_Server;Req;?;req body bad, dis error 1 (Input value too large to convert to
> this type), type=LocateJob
> 
> PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request
> Protocol MSG=cannot decode message), aux=0, type=LocateJob, from torqueuser@
> 
> 
> Most of the log messages have torqueuser at NODE but this just has "torqueuser@"
> 
> I don't find anything for these errors using search engines.  Any ideas?

The LocateJob DIS request is coming from outside of pbs_server.  While this is
happening, isolate the client that is triggering these messages and then we can
look at what is happening.  Note the client might be the scheduler.

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

See the Prop 8 Dishonor Roll at http://www.californiansagainsthate.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090129/e99c5d42/attachment.bin


More information about the torqueusers mailing list