[torqueusers] "Copy request failed" message upon job completion

Michael Homa mhoma at uic.edu
Tue Sep 23 16:12:26 MDT 2008


>
> The copying of stdout and stderr is also considered a stageout - the job
> will be in substate JOB_SUBSTATE_STAGEOUT while doing those copies.
>
> Can you look at the MOM logs on the relevant compute node?

Yes, but first, I increased the mom loglevel to 7 so lots of verbosity to
go through (sorry). I highlighted what I consider to be the key section.

09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;committing job
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;starting job execution
09/23/2008 16:53:49;0001;   pbs_mom;Job;job_nodes;0: argo17-1/0
09/23/2008 16:53:49;0001;   pbs_mom;Job;job_nodes;job: 379.argo.cc.uic.edu numnodes=1 numvnod=1
09/23/2008 16:53:49;0001;   pbs_mom;Svr;pbs_mom;Success (0) in init_groups, pre-sigprocmask
09/23/2008 16:53:49;0001;   pbs_mom;Svr;pbs_mom;Success (0) in init_groups, post-initgroups
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;evaluating limits for job
09/23/2008 16:53:49;0001;   pbs_mom;Job;379.argo.cc.uic.edu;about to fork child which will become job
09/23/2008 16:53:49;0001;   pbs_mom;Job;TMomFinalizeJob2;job: 379.argo.cc.uic.edu numnodes=1 numvnod=1
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_close_poll;entered
09/23/2008 16:53:49;0001;   pbs_mom;Job;379.argo.cc.uic.edu;phase 2 of job launch successfully completed
09/23/2008 16:53:49;0001;   pbs_mom;Job;379.argo.cc.uic.edu;task/session info loaded
09/23/2008 16:53:49;0001;   pbs_mom;Job;TMomFinalizeJob3;read start return code=0 session=12540
09/23/2008 16:53:49;0001;   pbs_mom;Job;379.argo.cc.uic.edu;saving task (TMomFinalizeJob3)
09/23/2008 16:53:49;0008;   pbs_mom;Svr;task_save;saving task in /var/spool/torque/mom_priv/jobs/379.argo.cc.TK/0000000001
09/23/2008 16:53:49;0001;   pbs_mom;Job;TMomFinalizeJob3;job 379.argo.cc.uic.edu started, pid = 12540
09/23/2008 16:53:49;0001;   pbs_mom;Job;379.argo.cc.uic.edu;job successfully started
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;job 379.argo.cc.uic.edu reported successful start on 1 node(s)
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;job execution started
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command StatusJob from PBS_Server
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type StatusJob from host argo.cc.uic.edu received
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type StatusJob from host argo.cc.uic.edu allowed
09/23/2008 16:53:49;0008;   pbs_mom;Job;dispatch_request;dispatching request StatusJob on sd=10
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;sending "send flagged" attr: session_id
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command ModifyJob from PBS_Server
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type ModifyJob from host argo.cc.uic.edu received
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type ModifyJob from host argo.cc.uic.edu allowed
09/23/2008 16:53:49;0008;   pbs_mom;Job;dispatch_request;dispatching request ModifyJob on sd=13
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;modifying job
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;modifying attribute resource of job (value: '???')
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_set_limits;mom_set_limits(379.argo.cc.uic.edu,alter) entered
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_set_limits;setting limit for attribute 'ncpus'
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_set_limits;setting limit for attribute 'neednodes'
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_set_limits;setting limit for attribute 'nodes'
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_set_limits;setting limit for attribute 'walltime'
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_set_limits;mom_set_limits(379.argo.cc.uic.edu,alter) completed
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;Job Modified at request of PBS_Server at argo.cc.uic.edu
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server
09/23/2008 16:53:49;0080;   pbs_mom;Svr;mom_get_sample;proc_array load started
09/23/2008 16:53:49;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=72
09/23/2008 16:53:49;0080;   pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 379.argo.cc.uic.edu
09/23/2008 16:53:49;0080;   pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 379.argo.cc.uic.edu
09/23/2008 16:53:49;0080;   pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 379.argo.cc.uic.edu
09/23/2008 16:53:49;0008;   pbs_mom;Job;scan_for_terminated;for job 379.argo.cc.uic.edu, task 1, pid=12540, exitcode=0
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;sending signal 9 to task
09/23/2008 16:53:49;0008;   pbs_mom;Svr;task_save;saving task in /var/spool/torque/mom_priv/jobs/379.argo.cc.TK/0000000001
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;scan_for_terminated: job 379.argo.cc.uic.edu task 1 terminated, sid=12540
09/23/2008 16:53:49;0080;   pbs_mom;Svr;scan_for_exiting;searching for exiting jobs
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;job was terminated
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;master task has exited - sending kill job request to all sisters
09/23/2008 16:53:49;0008;   pbs_mom;Req;send_sisters;sending command KILL_JOB for job 379.argo.cc.uic.edu (2)
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;task is dead
09/23/2008 16:53:49;0008;   pbs_mom;Svr;task_save;saving task in /var/spool/torque/mom_priv/jobs/379.argo.cc.TK/0000000001
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;local task termination detected.  killing job
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;kill_job
09/23/2008 16:53:49;0002;   pbs_mom;n/a;run_pelog;userepilog script '/var/spool/torque/mom_priv/epilogue.precancel' does not exist (cwd: /var/spool/torque/mom_priv)
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;kill_job done
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;sending preobit jobstat
09/23/2008 16:53:49;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
09/23/2008 16:53:49;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
09/23/2008 16:53:49;0080;   pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;performing job clean-up
09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_close_poll;entered
09/23/2008 16:53:49;0080;   pbs_mom;Svr;mom_get_sample;proc_array load started
09/23/2008 16:53:49;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=72
09/23/2008 16:53:49;0080;   pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 379.argo.cc.uic.edu
09/23/2008 16:53:49;0080;   pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 379.argo.cc.uic.edu
09/23/2008 16:53:49;0080;   pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 379.argo.cc.uic.edu
09/23/2008 16:53:49;0008;   pbs_mom;Job;scan_for_terminated;checking job post-processing routine
09/23/2008 16:53:49;0080;   pbs_mom;Req;post_epilogue;preparing obit message
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;obit sent to server
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command CopyFiles from PBS_Server
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type CopyFiles from host argo.cc.uic.edu received
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type CopyFiles from host argo.cc.uic.edu allowed

This is the key section:

09/23/2008 16:53:49;0008;   pbs_mom;Job;dispatch_request;dispatching request CopyFiles on sd=10
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;attempting to copy file 'argo.cc.uic.edu:/home/homes51/mhoma/hello_world.o379'
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;forking to user, uid: 52180  gid: 100  homedir: '/home/homes51/mhoma'

It says "attempting to copy file". Though it doesn't give a message saying "copying successful",
I do get the the stderr/sdtout files:

drwx------  38 mhoma sys       16384 Sep 23 16:54 .
-rw-------   1 mhoma users         0 Sep 23 16:53 hello_world.e379
-rw-------   1 mhoma users        12 Sep 23 16:53 hello_world.o379

cat hello_world.o379
Hello-world

The only thing that looks out of the ordinary is the GID on the two files.

The rest of the messages are the cleanup.

09/23/2008 16:53:49;0002;   pbs_mom;n/a;mom_close_poll;entered
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command DeleteJob from PBS_Server
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type DeleteJob from host argo.cc.uic.edu received
09/23/2008 16:53:49;0008;   pbs_mom;Job;process_request;request type DeleteJob from host argo.cc.uic.edu allowed
09/23/2008 16:53:49;0008;   pbs_mom;Job;dispatch_request;dispatching request DeleteJob on sd=10
09/23/2008 16:53:49;0008;   pbs_mom;Job;379.argo.cc.uic.edu;deleting job
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;deleting job 379.argo.cc.uic.edu in state EXITED
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;removing job
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;removed job script
09/23/2008 16:53:49;0080;   pbs_mom;Job;379.argo.cc.uic.edu;removed job file
09/23/2008 16:53:49;0080;   pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server
09/23/2008 16:53:49;0080;   pbs_mom;Svr;mom_get_sample;proc_array load started
09/23/2008 16:53:49;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=72
09/23/2008 16:53:49;0008;   pbs_mom;Job;scan_for_terminated;pid 12574 not tracked, exitcode=0

Michael


More information about the torqueusers mailing list