[torqueusers] pbs_mom: check_ms, MS reset from X to Y

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Thu Mar 1 09:48:57 MST 2007


Dear All,

from time to time we find in our mom_logs (torque-2.1.6) entries
like

  <hostname> pbs_mom: check_ms, MS reset from X to Y (somehost:1023)

where X and Y are numbers in the order of 1..50 (with Y being
larger than X) and "somehost" being an other compute node (probably
the one holding the master mom of the job.

What is the meaning and reason for these messages?

As it only occurs very rarely, I've no idea how to further
investigate the issue. From my current experience it seems that
when ever this message occurs Pete Wyckoff's mpiexec (0.82) fails
to start some of the 128 MPI processes it should start on 32 nodes.
If you try immediately after the error again (in the same
interactive PBS job) everything runs fine. (I guess it's the
pbs_mom who causes the mpiexec to fail not the other way round.)


Regards,

thomas
-- 
Dipl.-Ing. Thomas ZEISER
Regionales Rechenzentrum Erlangen
Martensstr. 1, 91058 Erlangen, GERMANY


More information about the torqueusers mailing list