[torqueusers] node_check_script with node_check_interval=jobstart only executed on first node

David Beer dbeer at adaptivecomputing.com
Mon Dec 2 12:15:41 MST 2013


Thanks Thomas.


On Mon, Dec 2, 2013 at 12:06 PM, Thomas Zeiser <
thomas.zeiser at rrze.uni-erlangen.de> wrote:

> On Mon, Dec 02, 2013 at 11:45:10AM -0700, David Beer wrote:
> > Thomas,
> >
> > Can you create an issue report for this problem on github?
> > https://github.com/adaptivecomputing/torque/issues
>
> done; https://github.com/adaptivecomputing/torque/issues/203
>
> > On Thu, Nov 28, 2013 at 9:11 AM, <thomas.zeiser at rrze.uni-erlangen.de>
> wrote:
> >
> > > On Wed, Nov 27, 2013 at 12:49:01PM +0100, Thomas Zeiser wrote:
> > > > Hello,
> > > >
> > > > we have in our mom_priv/config
> > > >   $node_check_script /var/spool/torque/mom_priv/health-check.sh
> > > >   $node_check_interval 0,jobstart
> > > >
> > > > However, it looks like the health-chck script is only executed on
> > > > the first node of a multi-node job, i.e. only on the node with the
> > > > Mother Superior.
> > >
> > > I just forgot to mention, we are observing that with both,
> > > torque-2.5.12 and torque-4.2.5.
> > >
> > > > I would expect the jobstart-nodecheck to be run at (before)
> > > > jobstart on EVERY node of the job.
> > > >
> > > >
> > > > Moreover, there are some inconsistencies in the documentation:
> > > > 1)
> > >
> http://docs.adaptivecomputing.com/torque/Content/topics/commands/pbs_mom.htm
> > > >    "The message (up to 256 characters) immediately following the
> > > >    Error string"
> > > > 2)
> > >
> http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/creatingHealthCheckScript.htm
> > > >    "The message (up to 1024 characters) immediately following the
> > > >    ERROR keyword"
> > > > => "Error" vs. "ERROR"; length 256 vs. 1024 characters
> > > >
> > > >
> > > > Best,
> > > >
> > > > Thomas
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131202/0ff5c910/attachment.html 


More information about the torqueusers mailing list