Bug 147 - nodes file not created on restart
: nodes file not created on restart
Status: ASSIGNED
Product: TORQUE
pbs_mom
: 2.5.x
: PC Linux
: P5 normal
Assigned To: Al Taufer
:
:
:
  Show dependency treegraph
 
Reported: 2011-07-12 18:18 MDT by Eric Roman
Modified: 2011-07-26 09:15 MDT (History)
4 users (show)

See Also:


Attachments
Fix pbs_demux build when BLCR support is configured in. (466 bytes, patch)
2011-07-21 19:46 MDT, Eric Roman
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Eric Roman 2011-07-12 18:18:07 MDT
When using qhold and qrls to run jobs I've found that the nodes file
is not created by MOM.  This causes problems with process migration, since
at restart time there's no simple way to identify which nodes have been
allocated to the restarted job.  I think that the fix here is that MOM should
build a new nodes file before the restart.
Comment 1 Ken Nielson 2011-07-19 14:07:32 MDT
(In reply to comment #0)
> When using qhold and qrls to run jobs I've found that the nodes file
> is not created by MOM.  This causes problems with process migration, since
> at restart time there's no simple way to identify which nodes have been
> allocated to the restarted job.  I think that the fix here is that MOM should
> build a new nodes file before the restart.

What do you mean by nodes file. There is a nodes file for the server so it
knows who all of the MOMs are in the cluster. But the MOMs do not maintain a
nodes file.
Comment 2 Al Taufer 2011-07-19 14:15:34 MDT
The mother superior will create a nodes file in TMomFinalizeChild() when it
starts the job, if the job has the MOM_HAS_NODEFILE flag set.
Comment 3 Ken Nielson 2011-07-19 14:19:58 MDT
(In reply to comment #2)
> The mother superior will create a nodes file in TMomFinalizeChild() when it
> starts the job, if the job has the MOM_HAS_NODEFILE flag set.

You mean the $PBS_NODESFILE. Thanks for the clarification.
Comment 4 Eric Roman 2011-07-19 15:22:18 MDT
> You mean the $PBS_NODESFILE. Thanks for the clarification.

Just for context, what I want to do here is roughly this.  At job startup,
I'm running (with Open MPI):

mpirun -am ft-enable-cr -hostfile $PBS_NODEFILE app

Later, when cr_checkpoint is called on the job script, we run ompi-checkpoint
to checkpoint the Open MPI app.  When the MPI checkpoint finishes, the job
script is checkpointed.  So far so good.

At restart time, we come back and want to run
ompi-restart -hostfile $PBS_NODEFILE <snapshot-id>
and this restarts the job.

The PBS_NODEFILE variable is taken from the checkpointed environment.
The environment array is saved with the checkpoint and restored at restart
time, so the name of the file, i.e. /var/spool/torque/aux/<jobid> can't change.
We need that file created with the new node names in it, so we can run the
ompi-restart with the updated node names.
Comment 5 Ken Nielson 2011-07-19 15:37:36 MDT
(In reply to comment #4)
No promises on this, however, this seems to be doable. The .JB files for the
job are still around on the restart and the exec_hosts list is in the .JB file.
The problem is with the job state on the server. If the restart is immediate
(before pbs_server has timed out the job) then the MOM could restart, pick up
the exec_host information and continue. However, if pbs_server has already
re-queued the job because it failed or deleted it we do not want to allow the
job to restart.
Comment 6 Eric Roman 2011-07-19 16:08:42 MDT
Ok, I haven't worked with the MOM failure cases yet.  

The case I'm thinking of here is with C/R in normal system operation.  The
mom(s) remain running the whole time, but the job is checkpointed and later
restarted, likely on a different set of nodes.

My main concern right now is with restarting the job after
qhold (holdjob) is called to take checkpoint and kill the job script.  
Later, when the scheduler decides to run the job again, a different set of
nodes can be assigned by the scheduler.  The exec_hosts list changes in
that case.

(In reply to comment #5)
> (In reply to comment #4)
> No promises on this, however, this seems to be doable. The .JB files for the
> job are still around on the restart and the exec_hosts list is in the .JB file.
> The problem is with the job state on the server. If the restart is immediate
> (before pbs_server has timed out the job) then the MOM could restart, pick up
> the exec_host information and continue. However, if pbs_server has already
> re-queued the job because it failed or deleted it we do not want to allow the
> job to restart.
Comment 7 Chris Samuel 2011-07-19 20:58:36 MDT
(In reply to comment #4)

> Just for context, what I want to do here is roughly this.  At job startup,
> I'm running (with Open MPI):
> 
> mpirun -am ft-enable-cr -hostfile $PBS_NODEFILE app

If you're using Open-MPI with Torque then there's no need to specify
a hostfile at all - as long as you've compiled it with support for the
TM API.

The TM API lets Open-MPI learn about where to run a job automatically
from the mother superior.  We use it all the time - it rocks. :-)

cheers,
Chris
Comment 8 Eric Roman 2011-07-20 11:55:20 MDT
I usually do this.  The problem is that when you add job checkpoint/restart to
the mix things no longer work properly.
I would like to see this fixed eventually, but given the effort
that's involved, it'd be easier to regenerate the node file.

The issue is basically this.  Open MPI needs to save outstanding messages
before
the checkpoint is taken.  To do this, you invoke ompi-checkpoint, and a control
message is sent out to the rank processes, which then wait for the messages to
be sent, and
call BLCR (or something else) to checkpoint.  If you checkpoint the ranks
directly with cr_checkpoint, which is what happens when you invoke TM, the
control messages aren't sent, and the MPI application hangs.  To work around
that, we need to hide the rank processes from Torque.  I was told by the
Open MPI developers that this is very difficult to fix.

Early on, when we were first working on BLCR support for Torque, we discussed
using checkpoints with TM, and decided not to support it yet.  
I would like to see this fixed eventually, but given the effort
that's involved, it'd be easier to regenerate the node file.

> If you're using Open-MPI with Torque then there's no need to specify
> a hostfile at all - as long as you've compiled it with support for the
> TM API.
> 
> The TM API lets Open-MPI learn about where to run a job automatically
> from the mother superior.  We use it all the time - it rocks. :-)
> 
> cheers,
> Chris
Comment 9 Al Taufer 2011-07-20 12:03:37 MDT
Changes to recreate the $PBS_NODESFILE on checkpoint restart have been checked
into the 2.5, 3.0 and trunk branches and will be included in future snapshots
and releases.
Comment 10 Eric Roman 2011-07-21 19:44:42 MDT
(In reply to comment #9)
> Changes to recreate the $PBS_NODESFILE on checkpoint restart have been checked
> into the 2.5, 3.0 and trunk branches and will be included in future snapshots
> and releases.

Ok, I pulled the 2.5 fixes branch from subversion, and was able to migrate
a few test jobs successfully with the rebuilt node file.

I did run into a little issue building pbs_demux with the BLCR support compiled
in.  The pbs_demux_CPPFLAGS variable was commented out of
src/resmom/Makefile.am,
and this broke the build. 

I think it was commented out because of a mistake.  The line incorrectly
defines a per-target _CPPFLAGS by appending CPPFLAGS, rather than AM_CPPFLAGS,
i.e. it should read:
pbs_demux_CPPFLAGS = $(BLCR_CPPFLAGS) $(AM_CPPFLAGS)
to inherit the AM_CPPFLAGS that are defined earlier in the Makefile.am.

I'll attach a patch that fixes this.
Comment 11 Eric Roman 2011-07-21 19:46:11 MDT
Created an attachment (id=85) [details]
Fix pbs_demux build when BLCR support is configured in.
Comment 12 Al Taufer 2011-07-26 09:15:31 MDT
I have checked in Eric's patch to fix the pbs_demux build issue.  It is now in
2.5, 3.0 and trunk.