[torqueusers] "Copy request failed" message upon job completion

Michael Homa mhoma at uic.edu
Tue Sep 23 14:44:56 MDT 2008


I'm using the following two products: torque version 2.2.1 and maui
version maui-3.2.6p20. I'm currently testing my configuration for both.
A simple hello_world stub:

  #PBS -N hello_world
  #PBS -q dedicated
  #PBS -l nodes=argo17-1
  /home/homes51/mhoma/a.out

executes properly and both the standard error and standard out files are created in
the test directory. Stderr is an empty file and stdout is the one line from
hello_world. However, the output from tracejob reveals the following (server
loglevel = 7):

  09/23/2008 12:11:30  S    JOB_SUBSTATE_EXITING
  09/23/2008 12:11:30  S    JOB_SUBSTATE_STAGEOUT
  09/23/2008 12:11:30  S    about to copy stdout/stderr/stageout files
  09/23/2008 12:11:30  S    copy request failed

I'm not staging in/out any files and I do get the stdout/stderr files.
But, the line "JOB_SUBSTATE_STAGEOUT" prior to the "about to copy line..."
makes me think that something is going on with staging even though I
haven't specified in the qsub statement.

Torque has been configured to use scp and I tested the configuration and
authentication setup by doing a straightforward scp from Linux the Linux
command line.

There are no server or queue parameters that concern file copying and
staging. I was able to determine that I would get the message when I
tried to stagein a file.

I took a look at the server source code and the key section of code is in
the server program req_jobobit.c. There is a call to a routine called
issue_Drequest. If the return code from that function is zero, then the
copy fails. Then, if the loglevel is => 1, the error message for the
failed copy is printed. The code in issue_Drequest is a bit "trickier" to
follow. If I set the loglevel to zero, the error message is not written to
the log. Obviously, I can make the message go away by setting server
loglevel to zero. But, a problem hidden is not a problem solved (no matter
how attractive the thought of doing so is).

I'm unclear as to what is failing.  As I said, I get the std[out/err]
files. A search of the torque archives didn't reveal anything pertinent
(though I'm willing to concede that maybe I missed something). The
section in the wiki, "6.3 File State-In/Stage-out" didn't have anything
applicable. Short of coding print statements into issue_Drequest to
trace the logic and identify the culprint, I'm out of ideas. Can anyone
suggest something I haven't thought of trying.

Michael Homa
Operating Systems Support and Database Group
Academic Computing and Communication Center
University of Illinois at Chicago
email:  mhoma at uic.edu



More information about the torqueusers mailing list