[torquedev] Multi-process checkpointing

Danny Sternkopf dsternkopf at hpce.nec.com
Tue Jun 29 07:36:43 MDT 2010


Hi,

still the same setup:
o torque 2.4.8
o maui 3.3
o blcr 0.8.2

It turned out that multi-process jobs can't be restartet after 
checkpointing.

Please look at ./src/server/req_runjob.c line 1429:
if (strcmp(prun->rq_destin, exec_host) != 0)

This comparison gives for a job sumbitted with -lnodes=1:ppn=1:

prun->rq_destin: htx5 vs. exec_host: htx5
-> which is okay. Hosts are the same, no failure.

But the same comparison gives for a job sumbitted with -lnodes=1:ppn=2:

prun->rq_destin: htx5:ppn=2 vs. exec_host: htx5
-> which is not the same and gives and failure.

Thats why qrls fails for all jobs which allocated more than one CPU. Or 
is there anything one could setup for Torque or Maui?

This issue also affects multi-host jobs.

Regards,

Danny


More information about the torquedev mailing list