[torqueusers] "Copy request failed" message upon job completion
mhoma at uic.edu
Tue Sep 23 14:44:56 MDT 2008
I'm using the following two products: torque version 2.2.1 and maui
version maui-3.2.6p20. I'm currently testing my configuration for both.
A simple hello_world stub:
#PBS -N hello_world
#PBS -q dedicated
#PBS -l nodes=argo17-1
executes properly and both the standard error and standard out files are created in
the test directory. Stderr is an empty file and stdout is the one line from
hello_world. However, the output from tracejob reveals the following (server
loglevel = 7):
09/23/2008 12:11:30 S JOB_SUBSTATE_EXITING
09/23/2008 12:11:30 S JOB_SUBSTATE_STAGEOUT
09/23/2008 12:11:30 S about to copy stdout/stderr/stageout files
09/23/2008 12:11:30 S copy request failed
I'm not staging in/out any files and I do get the stdout/stderr files.
But, the line "JOB_SUBSTATE_STAGEOUT" prior to the "about to copy line..."
makes me think that something is going on with staging even though I
haven't specified in the qsub statement.
Torque has been configured to use scp and I tested the configuration and
authentication setup by doing a straightforward scp from Linux the Linux
There are no server or queue parameters that concern file copying and
staging. I was able to determine that I would get the message when I
tried to stagein a file.
I took a look at the server source code and the key section of code is in
the server program req_jobobit.c. There is a call to a routine called
issue_Drequest. If the return code from that function is zero, then the
copy fails. Then, if the loglevel is => 1, the error message for the
failed copy is printed. The code in issue_Drequest is a bit "trickier" to
follow. If I set the loglevel to zero, the error message is not written to
the log. Obviously, I can make the message go away by setting server
loglevel to zero. But, a problem hidden is not a problem solved (no matter
how attractive the thought of doing so is).
I'm unclear as to what is failing. As I said, I get the std[out/err]
files. A search of the torque archives didn't reveal anything pertinent
(though I'm willing to concede that maybe I missed something). The
section in the wiki, "6.3 File State-In/Stage-out" didn't have anything
applicable. Short of coding print statements into issue_Drequest to
trace the logic and identify the culprint, I'm out of ideas. Can anyone
suggest something I haven't thought of trying.
Operating Systems Support and Database Group
Academic Computing and Communication Center
University of Illinois at Chicago
email: mhoma at uic.edu
More information about the torqueusers