[torquedev] [Bug 94] New: error handling in checkpointing process

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Mon Nov 1 15:27:01 MDT 2010


           Summary: error handling in checkpointing process
           Product: TORQUE
           Version: 2.4.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: robinr at muohio.edu
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

In 2.4.11, part of the checkpointing process is to have pbs_mom scp'ing the
checkpoint files back to the pbs_server.

I tested that if the sshd on pbs_server is stopped and the scp process failed,
the job still gets moved to the state of 'H'. When the job is qrls'ed later, it
would not work since the checked-pointed image does not exist.

Ideally, if scp fails in the process, the job did not get moved to the 'H'
state and remains running. Even more feedback to the client is better.

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list