[torqueusers] "Copy request failed" message upon job completion

Michael Homa mhoma at uic.edu
Tue Sep 23 14:44:56 MDT 2008

I'm using the following two products: torque version 2.2.1 and maui
version maui-3.2.6p20. I'm currently testing my configuration for both.
A simple hello_world stub:

  #PBS -N hello_world
  #PBS -q dedicated
  #PBS -l nodes=argo17-1

executes properly and both the standard error and standard out files are created in
the test directory. Stderr is an empty file and stdout is the one line from
hello_world. However, the output from tracejob reveals the following (server
loglevel = 7):

  09/23/2008 12:11:30  S    JOB_SUBSTATE_EXITING
  09/23/2008 12:11:30  S    JOB_SUBSTATE_STAGEOUT
  09/23/2008 12:11:30  S    about to copy stdout/stderr/stageout files
  09/23/2008 12:11:30  S    copy request failed

I'm not staging in/out any files and I do get the stdout/stderr files.
But, the line "JOB_SUBSTATE_STAGEOUT" prior to the "about to copy line..."
makes me think that something is going on with staging even though I
haven't specified in the qsub statement.

Torque has been configured to use scp and I tested the configuration and
authentication setup by doing a straightforward scp from Linux the Linux
command line.

There are no server or queue parameters that concern file copying and
staging. I was able to determine that I would get the message when I
tried to stagein a file.

I took a look at the server source code and the key section of code is in
the server program req_jobobit.c. There is a call to a routine called
issue_Drequest. If the return code from that function is zero, then the
copy fails. Then, if the loglevel is => 1, the error message for the
failed copy is printed. The code in issue_Drequest is a bit "trickier" to
follow. If I set the loglevel to zero, the error message is not written to
the log. Obviously, I can make the message go away by setting server
loglevel to zero. But, a problem hidden is not a problem solved (no matter
how attractive the thought of doing so is).

I'm unclear as to what is failing.  As I said, I get the std[out/err]
files. A search of the torque archives didn't reveal anything pertinent
(though I'm willing to concede that maybe I missed something). The
section in the wiki, "6.3 File State-In/Stage-out" didn't have anything
applicable. Short of coding print statements into issue_Drequest to
trace the logic and identify the culprint, I'm out of ideas. Can anyone
suggest something I haven't thought of trying.

Michael Homa
Operating Systems Support and Database Group
Academic Computing and Communication Center
University of Illinois at Chicago
email:  mhoma at uic.edu

More information about the torqueusers mailing list