[torquedev] Multi-process checkpointing
Danny Sternkopf
dsternkopf at hpce.nec.com
Tue Jun 29 07:36:43 MDT 2010
Hi,
still the same setup:
o torque 2.4.8
o maui 3.3
o blcr 0.8.2
It turned out that multi-process jobs can't be restartet after
checkpointing.
Please look at ./src/server/req_runjob.c line 1429:
if (strcmp(prun->rq_destin, exec_host) != 0)
This comparison gives for a job sumbitted with -lnodes=1:ppn=1:
prun->rq_destin: htx5 vs. exec_host: htx5
-> which is okay. Hosts are the same, no failure.
But the same comparison gives for a job sumbitted with -lnodes=1:ppn=2:
prun->rq_destin: htx5:ppn=2 vs. exec_host: htx5
-> which is not the same and gives and failure.
Thats why qrls fails for all jobs which allocated more than one CPU. Or
is there anything one could setup for Torque or Maui?
This issue also affects multi-host jobs.
Regards,
Danny
More information about the torquedev
mailing list