Bug 94 - error handling in checkpointing process
: error handling in checkpointing process
Status: NEW
Product: TORQUE
: 2.4.x
: PC Linux
: P5 normal
Assigned To: Glen
  Show dependency treegraph
Reported: 2010-11-01 15:27 MDT by R
Modified: 2010-11-01 15:27 MDT (History)
1 user (show)

See Also:



You need to log in before you can comment on or make changes to this bug.

Description R 2010-11-01 15:27:01 MDT
In 2.4.11, part of the checkpointing process is to have pbs_mom scp'ing the
checkpoint files back to the pbs_server.

I tested that if the sshd on pbs_server is stopped and the scp process failed,
the job still gets moved to the state of 'H'. When the job is qrls'ed later, it
would not work since the checked-pointed image does not exist.

Ideally, if scp fails in the process, the job did not get moved to the 'H'
state and remains running. Even more feedback to the client is better.