[torquedev] Process migration with checkpoints (MSG=allocated nodes must match checkpoint location)
ERoman at lbl.gov
Wed Dec 1 18:36:45 MST 2010
Dear Torque Developers,
I taking a deeper look at the Torque server's checkpoint/restart code to
understand why qrls fails sometimes with the error message:
Reject reply code=15059(Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match checkpoint location)
when used with BLCR.
What I've found is that the Torque server code, by default, does not seem to
support process migration. There's a flag that gets set for checkpointed
jobs, JOB_SVFLG_CHECKPOINT_FILE, that the server (in req_runjob.c) interprets
as meaning that the checkpoint file cannot migrate. The server tries to
reallocate the same hosts originally allocated to the job, using the exec_host
field in the job spec. If that allocation fails, the server rejects the
job request with the error message listed above.
I think the JOB_SVFLG_CHECKPOINT_FILE dates back to the old Cray (C90 or
possibly T3E) days. Cray systems had restrictions on migrating checkpoints.
It looks like the server code was originally implemented without support for
checkpoint migration, since there's a separate flag
JOB_SVFLG_CHECKPOINT_MIGRATEABLE present in the server code. It's not clear to
me yet how this flag gets set, or whether the migration support is fully
implemented, but process migration is something that we want for integration
with Maui's preemption support.
I'm not sure what the right way to proceed here is. We could simply ignore the
JOB_SVFLG_CHECKPOINT_FILE flag, and assign nodes normally (from the run
request?). I think that's the easiest thing to implement, but it's not clear
whether there other places in the code where it's assumed that the file won't
migrate. I'm going to go ahead and try testing this, but I'd love to know
if anyone else has thought about this. What do you think?
More information about the torquedev