[torqueusers] node_check_script with node_check_interval=jobstart only executed on first node

thomas.zeiser at rrze.uni-erlangen.de thomas.zeiser at rrze.uni-erlangen.de
Thu Nov 28 09:11:35 MST 2013


On Wed, Nov 27, 2013 at 12:49:01PM +0100, Thomas Zeiser wrote:
> Hello,
> 
> we have in our mom_priv/config
>   $node_check_script /var/spool/torque/mom_priv/health-check.sh
>   $node_check_interval 0,jobstart
>
> However, it looks like the health-chck script is only executed on
> the first node of a multi-node job, i.e. only on the node with the
> Mother Superior.

I just forgot to mention, we are observing that with both,
torque-2.5.12 and torque-4.2.5.

> I would expect the jobstart-nodecheck to be run at (before)
> jobstart on EVERY node of the job.
> 
> 
> Moreover, there are some inconsistencies in the documentation:
> 1) http://docs.adaptivecomputing.com/torque/Content/topics/commands/pbs_mom.htm
>    "The message (up to 256 characters) immediately following the
>    Error string"
> 2) http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/creatingHealthCheckScript.htm
>    "The message (up to 1024 characters) immediately following the
>    ERROR keyword"
> => "Error" vs. "ERROR"; length 256 vs. 1024 characters
> 
> 
> Best,
> 
> Thomas


More information about the torqueusers mailing list