[torquedev] BLCR with Torque
Michel Béland
michel.beland at rqchp.qc.ca
Tue Dec 15 14:06:52 MST 2009
Hello,
We are trying to use BLCR with Torque 2.4.3 and MPI jobs here. We are
facing some problems. We started by configuring Torque to be able to
checkpoint sequential jobs. This works pretty well now. The
checkpointing script runs cr_checkpoint with the process id of the shell
above the PBS script.
When we request more that one node, Torque starts a pbs_demux process
below the shell that we try to checkpoint. When cr_checkpoint runs, it
fails. If we kill pbs_demux annd try again, we can get a checkpoint
(maybe unusable).
Does anybody know why pbs_demux perturbs the checkpointing procedure and
what we can do to make it work? After solving this problem, we have to
checkpoint the MPI program itself, but that is another story...
--
Michel Béland, analyste en calcul scientifique
michel.beland at rqchp.qc.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892 télécopieur : 514 343-2155
RQCHP (Réseau québécois de calcul de haute performance) www.rqchp.qc.ca
More information about the torquedev
mailing list