[torqueusers] node_check_script with node_check_interval=jobstart only executed on first node

David Beer dbeer at adaptivecomputing.com
Mon Dec 2 11:45:10 MST 2013


Thomas,

Can you create an issue report for this problem on github?
https://github.com/adaptivecomputing/torque/issues


On Thu, Nov 28, 2013 at 9:11 AM, <thomas.zeiser at rrze.uni-erlangen.de> wrote:

> On Wed, Nov 27, 2013 at 12:49:01PM +0100, Thomas Zeiser wrote:
> > Hello,
> >
> > we have in our mom_priv/config
> >   $node_check_script /var/spool/torque/mom_priv/health-check.sh
> >   $node_check_interval 0,jobstart
> >
> > However, it looks like the health-chck script is only executed on
> > the first node of a multi-node job, i.e. only on the node with the
> > Mother Superior.
>
> I just forgot to mention, we are observing that with both,
> torque-2.5.12 and torque-4.2.5.
>
> > I would expect the jobstart-nodecheck to be run at (before)
> > jobstart on EVERY node of the job.
> >
> >
> > Moreover, there are some inconsistencies in the documentation:
> > 1)
> http://docs.adaptivecomputing.com/torque/Content/topics/commands/pbs_mom.htm
> >    "The message (up to 256 characters) immediately following the
> >    Error string"
> > 2)
> http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/creatingHealthCheckScript.htm
> >    "The message (up to 1024 characters) immediately following the
> >    ERROR keyword"
> > => "Error" vs. "ERROR"; length 256 vs. 1024 characters
> >
> >
> > Best,
> >
> > Thomas
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131202/dd7c9234/attachment.html 


More information about the torqueusers mailing list