[torquedev] [Bug 94] New: error handling in checkpointing process
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Mon Nov 1 15:27:01 MDT 2010
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=94
Summary: error handling in checkpointing process
Product: TORQUE
Version: 2.4.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P5
Component: pbs_server
AssignedTo: glen.beane at gmail.com
ReportedBy: robinr at muohio.edu
CC: torquedev at supercluster.org
Estimated Hours: 0.0
In 2.4.11, part of the checkpointing process is to have pbs_mom scp'ing the
checkpoint files back to the pbs_server.
I tested that if the sshd on pbs_server is stopped and the scp process failed,
the job still gets moved to the state of 'H'. When the job is qrls'ed later, it
would not work since the checked-pointed image does not exist.
Ideally, if scp fails in the process, the job did not get moved to the 'H'
state and remains running. Even more feedback to the client is better.
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list