[torqueusers] "Copy request failed" message upon job completion
David.Singleton at anu.edu.au
Tue Sep 23 14:52:15 MDT 2008
The copying of stdout and stderr is also considered a stageout - the job
will be in substate JOB_SUBSTATE_STAGEOUT while doing those copies.
Can you look at the MOM logs on the relevant compute node?
Michael Homa wrote:
> I'm using the following two products: torque version 2.2.1 and maui
> version maui-3.2.6p20. I'm currently testing my configuration for both.
> A simple hello_world stub:
> #PBS -N hello_world
> #PBS -q dedicated
> #PBS -l nodes=argo17-1
> executes properly and both the standard error and standard out files are created in
> the test directory. Stderr is an empty file and stdout is the one line from
> hello_world. However, the output from tracejob reveals the following (server
> loglevel = 7):
> 09/23/2008 12:11:30 S JOB_SUBSTATE_EXITING
> 09/23/2008 12:11:30 S JOB_SUBSTATE_STAGEOUT
> 09/23/2008 12:11:30 S about to copy stdout/stderr/stageout files
> 09/23/2008 12:11:30 S copy request failed
> I'm not staging in/out any files and I do get the stdout/stderr files.
> But, the line "JOB_SUBSTATE_STAGEOUT" prior to the "about to copy line..."
> makes me think that something is going on with staging even though I
> haven't specified in the qsub statement.
> Torque has been configured to use scp and I tested the configuration and
> authentication setup by doing a straightforward scp from Linux the Linux
> command line.
> There are no server or queue parameters that concern file copying and
> staging. I was able to determine that I would get the message when I
> tried to stagein a file.
> I took a look at the server source code and the key section of code is in
> the server program req_jobobit.c. There is a call to a routine called
> issue_Drequest. If the return code from that function is zero, then the
> copy fails. Then, if the loglevel is => 1, the error message for the
> failed copy is printed. The code in issue_Drequest is a bit "trickier" to
> follow. If I set the loglevel to zero, the error message is not written to
> the log. Obviously, I can make the message go away by setting server
> loglevel to zero. But, a problem hidden is not a problem solved (no matter
> how attractive the thought of doing so is).
> I'm unclear as to what is failing. As I said, I get the std[out/err]
> files. A search of the torque archives didn't reveal anything pertinent
> (though I'm willing to concede that maybe I missed something). The
> section in the wiki, "6.3 File State-In/Stage-out" didn't have anything
> applicable. Short of coding print statements into issue_Drequest to
> trace the logic and identify the culprint, I'm out of ideas. Can anyone
> suggest something I haven't thought of trying.
> Michael Homa
> Operating Systems Support and Database Group
> Academic Computing and Communication Center
> University of Illinois at Chicago
> email: mhoma at uic.edu
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers