Bugzilla – Bug 94
error handling in checkpointing process
Last modified: 2010-11-01 15:27:01 MDT
You need to
before you can comment on or make changes to this bug.
In 2.4.11, part of the checkpointing process is to have pbs_mom scp'ing the
checkpoint files back to the pbs_server.
I tested that if the sshd on pbs_server is stopped and the scp process failed,
the job still gets moved to the state of 'H'. When the job is qrls'ed later, it
would not work since the checked-pointed image does not exist.
Ideally, if scp fails in the process, the job did not get moved to the 'H'
state and remains running. Even more feedback to the client is better.