[torqueusers] Some nodes not returning .o and .e files

David Beer dbeer at adaptivecomputing.com
Fri Dec 3 15:11:33 MST 2010



----- Original Message -----
> Hi All,
> 
> We're just in the final stages of commissioning our new cluster and
> have come across a strange behaviour.
> 
> Because we often have a very large number of short-running (iup to 10
> hours) jobs, our login nodes are set up to also recieve jobs from the
> queueing system that can run and complete overnight (between 6pm and
> 8am).
> 
> On all our compute nodes, when jobs finish, the .o and .e files
> created by Torque are returned to the user as expected, but on the
> login nodes, they're not. They are created as usual, but just end up
> in the undelivered directory under /usr/spool/PBS
> 
> These nodes are using the same config as the others, so the
> filesystems identified in usecp are the same. These filesystems are
> mounted across the cluster via NFS from the same sources
> 
> Apart from the time restriction, the only other difference we can see
> between these and the compute nodes is that these are aslo the submit
> hosts for jobs. As far as we can tell, the config is the same as the
> one on pour existing cluster, where everything works as expected.
> 
> Can anyone shed some light as to where we might look next ?
> 
> Thanks,
> Andrew
> 
> 

Andrew,

The first place I would look would be the mom log files. There should be information about copy failures if any, or any other errors. Try grepping for the job id.

Cheers,

David



More information about the torqueusers mailing list