[torquedev] BLCR with Torque

Michel Béland michel.beland at rqchp.qc.ca
Tue Dec 15 14:06:52 MST 2009


Hello,

We are trying to use BLCR with Torque 2.4.3 and MPI jobs here. We are 
facing some problems. We started by configuring Torque to be able to 
checkpoint sequential jobs. This works pretty well now. The 
checkpointing script runs cr_checkpoint with the process id of the shell 
above the PBS script.

When we request more that one node, Torque starts a pbs_demux process 
below the shell that we try to checkpoint. When cr_checkpoint runs, it 
fails. If we kill pbs_demux annd try again, we can get a checkpoint 
(maybe unusable).

Does anybody know why pbs_demux perturbs the checkpointing procedure and 
what we can do to make it work? After solving this problem, we have to 
checkpoint the MPI program itself, but that is another story...


-- 
Michel Béland, analyste en calcul scientifique
michel.beland at rqchp.qc.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
RQCHP (Réseau québécois de calcul de haute performance)  www.rqchp.qc.ca


More information about the torquedev mailing list