[torquedev] Should a communication error between pbs_mom's kill a job ?

Michael Barnes barnes at jlab.org
Mon May 4 09:37:46 MDT 2009


On Mon, May 04, 2009 at 10:35:37PM +1000, Chris Samuel wrote:
> We're starting to get complaints from a cluster we're
> helping out on where spurious communication faults between
> pbs_mom's are causing jobs to get killed off, even though
> they are continuing to function properly.  For example:
> 
> =>> PBS: job killed: node 11 (shrek026) requested job terminate
> , 'EOF' (code 1099) - internal or network failure attempting to  
> communicate with sister MOM's
> 
> This usually seems to happen when nodes are under high
> I/O load (mostly NFS related) and doesn't appear to be
> related to any actual issues on the nodes.
> 
> To me the simplest solution for this case appears to be
> to just return 0 from the job_over_limit() function when
> it's in that condition, rather than letting it fall through
> to the return(1).

I've always modified the code so that a mom could not kill a job
whenever I install PBS/TORQUE. I cannot think of a reason why one mom
should terminate a job unless the job has actually gone over resource
limits as the function name implies.

-mb

-- 
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------


More information about the torquedev mailing list