[torqueusers] "Copy request failed" message upon job completion
josh at clusterresources.com
Fri Sep 26 07:44:52 MDT 2008
See my comments below:
> I took a look at the server source code and the key section of code is
> in the server program req_jobobit.c. There is a call to a routine called
> issue_Drequest. If the return code from that function is zero, then
> the copy fails. Then, if the loglevel is => 1, the error message for the
> failed copy is printed. The code in issue_Drequest is a bit "trickier"
> to follow. If I set the loglevel to zero, the error message is not
> written to the log. Obviously, I can make the message go away by setting server
> loglevel to zero. But, a problem hidden is not a problem solved (no
> matter how attractive the thought of doing so is).
You are right that in req_jobobit.c if issue_Drequest returns a 0 when in the JOB_SUBSTATE_STAGEOUT step, TORQUE reports this as a failure. This is incorrect, however. When issue_Drequest returns 0 it means that the function successfully sent the request to the MOM. We fixed this a few weeks ago. This fix is available in newer snapshots and will be part of the official release for TORQUE 2.3.5.
Most likely nothing is really failing. This is just a false error.
More information about the torqueusers