[torqueusers] "Copy request failed" message upon job completion
mhoma at uic.edu
Fri Sep 26 08:38:06 MDT 2008
On Fri, 26 Sep 2008, Josh Butikofer wrote:
> See my comments below:
> > I took a look at the server source code and the key section of code is
> > in the server program req_jobobit.c. There is a call to a routine called
> > issue_Drequest. If the return code from that function is zero, then
> > the copy fails. Then, if the loglevel is => 1, the error message for the
> > failed copy is printed. The code in issue_Drequest is a bit "trickier"
> > to follow. If I set the loglevel to zero, the error message is not
> > written to the log. Obviously, I can make the message go away by setting server
> > loglevel to zero. But, a problem hidden is not a problem solved (no
> > matter how attractive the thought of doing so is).
> You are right that in req_jobobit.c if issue_Drequest returns a 0 when in the
> JOB_SUBSTATE_STAGEOUT step, TORQUE reports this as a failure. This is
> incorrect, however.
Thank you so much; this was driving me crazy:
issue_request.c case value ===> 54 (case PBS_Batch_CopyFiles)
issue_request.c 1. just entered PBS_BATCH_CopyFiles section
issue_request.c 2. return code from encode_DIS_ReqHdr --> 0
issue_request.c 3. return code from encode_DIS_CopyFiles --> 0
issue_request.c 4. return code from encode_DIS_ReqExtend --> 0
issue_request.c 5. return code from DIS_tcp_wflush) --> 0
issue_request.c 6. Leaving issue_Drequest rc -> 0
req_jobobit 2. 0 rc from issue_Drequest means failure
Many coders use a zero return code for "success". It just didn't seem
right to me that both std[out/err] were created and copied to the user's
home directory and yet the issue_Drequest would return a failed return code.
> When issue_Drequest returns 0 it means that the function successfully sent
> the request to the MOM. We fixed this a few weeks ago. This fix is available
> in newer snapshots and will be part of the official release for TORQUE 2.3.5.
I'm going to ask a REALLY stupid question so I apologize. How does one
install a snapshot? Is there a document that explains the procedure?
> Most likely nothing is really failing. This is just a false error.
> --Josh B.
More information about the torqueusers