[torqueusers] node_check_script with node_check_interval=jobstart only executed on first node

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Mon Dec 2 12:06:37 MST 2013


On Mon, Dec 02, 2013 at 11:45:10AM -0700, David Beer wrote:
> Thomas,
> 
> Can you create an issue report for this problem on github?
> https://github.com/adaptivecomputing/torque/issues

done; https://github.com/adaptivecomputing/torque/issues/203

> On Thu, Nov 28, 2013 at 9:11 AM, <thomas.zeiser at rrze.uni-erlangen.de> wrote:
> 
> > On Wed, Nov 27, 2013 at 12:49:01PM +0100, Thomas Zeiser wrote:
> > > Hello,
> > >
> > > we have in our mom_priv/config
> > >   $node_check_script /var/spool/torque/mom_priv/health-check.sh
> > >   $node_check_interval 0,jobstart
> > >
> > > However, it looks like the health-chck script is only executed on
> > > the first node of a multi-node job, i.e. only on the node with the
> > > Mother Superior.
> >
> > I just forgot to mention, we are observing that with both,
> > torque-2.5.12 and torque-4.2.5.
> >
> > > I would expect the jobstart-nodecheck to be run at (before)
> > > jobstart on EVERY node of the job.
> > >
> > >
> > > Moreover, there are some inconsistencies in the documentation:
> > > 1)
> > http://docs.adaptivecomputing.com/torque/Content/topics/commands/pbs_mom.htm
> > >    "The message (up to 256 characters) immediately following the
> > >    Error string"
> > > 2)
> > http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/creatingHealthCheckScript.htm
> > >    "The message (up to 1024 characters) immediately following the
> > >    ERROR keyword"
> > > => "Error" vs. "ERROR"; length 256 vs. 1024 characters
> > >
> > >
> > > Best,
> > >
> > > Thomas


More information about the torqueusers mailing list