[torquedev] [Bug 68] New: Releasing of multi-process checkpointed job fails

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Mon Jul 5 08:53:01 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=68

           Summary: Releasing of multi-process checkpointed job fails
           Product: TORQUE
           Version: 2.4.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: dsternkopf at hpce.nec.com
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


Setup:
o torque 2.4.8
o maui 3.3
o blcr 0.8.2


Please look at ./src/server/req_runjob.c line 1429:
if (strcmp(prun->rq_destin, exec_host) != 0)

This comparison gives for a job submitted with -lnodes=1:ppn=1:

prun->rq_destin: htx5 vs. exec_host: htx5
-> which is okay. Hosts are the same, no failure.

But the same comparison gives for a job sumbitted with -lnodes=1:ppn=2:

prun->rq_destin: htx5:ppn=2 vs. exec_host: htx5
-> which is not the same and gives the failure:
"...allocated nodes must match checkpoint location..."

Regards,

Danny

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list