Bug 68 - Releasing of multi-process checkpointed job fails
: Releasing of multi-process checkpointed job fails
Status: RESOLVED FIXED
Product: TORQUE
pbs_server
: 2.4.x
: PC Linux
: P5 major
Assigned To: Al Taufer
:
:
:
  Show dependency treegraph
 
Reported: 2010-07-05 08:53 MDT by Danny Sternkopf
Modified: 2011-01-24 11:33 MST (History)
1 user (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Danny Sternkopf 2010-07-05 08:53:00 MDT
Setup:
o torque 2.4.8
o maui 3.3
o blcr 0.8.2


Please look at ./src/server/req_runjob.c line 1429:
if (strcmp(prun->rq_destin, exec_host) != 0)

This comparison gives for a job submitted with -lnodes=1:ppn=1:

prun->rq_destin: htx5 vs. exec_host: htx5
-> which is okay. Hosts are the same, no failure.

But the same comparison gives for a job sumbitted with -lnodes=1:ppn=2:

prun->rq_destin: htx5:ppn=2 vs. exec_host: htx5
-> which is not the same and gives the failure:
"...allocated nodes must match checkpoint location..."

Regards,

Danny
Comment 1 Al Taufer 2011-01-24 11:33:00 MST
Changes have been backported from 2.5.5 to fix this issue.