[torqueusers] node_check_script with node_check_interval=jobstart only executed on first node
dbeer at adaptivecomputing.com
Mon Dec 2 11:45:10 MST 2013
Can you create an issue report for this problem on github?
On Thu, Nov 28, 2013 at 9:11 AM, <thomas.zeiser at rrze.uni-erlangen.de> wrote:
> On Wed, Nov 27, 2013 at 12:49:01PM +0100, Thomas Zeiser wrote:
> > Hello,
> > we have in our mom_priv/config
> > $node_check_script /var/spool/torque/mom_priv/health-check.sh
> > $node_check_interval 0,jobstart
> > However, it looks like the health-chck script is only executed on
> > the first node of a multi-node job, i.e. only on the node with the
> > Mother Superior.
> I just forgot to mention, we are observing that with both,
> torque-2.5.12 and torque-4.2.5.
> > I would expect the jobstart-nodecheck to be run at (before)
> > jobstart on EVERY node of the job.
> > Moreover, there are some inconsistencies in the documentation:
> > 1)
> > "The message (up to 256 characters) immediately following the
> > Error string"
> > 2)
> > "The message (up to 1024 characters) immediately following the
> > ERROR keyword"
> > => "Error" vs. "ERROR"; length 256 vs. 1024 characters
> > Best,
> > Thomas
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Senior Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers