[torqueusers] intermom & copyback weirdness

Daniel Widyono widyono at seas.upenn.edu
Thu Jul 28 15:50:06 MDT 2005


I have another question, regarding inter-mom communication.  I am successfully
running jobs, using SSH to login to the nodes and among the nodes, and can do
just about everything on multiple nodes.  However, upon completion, I get
errors seemingly related to copyback.  I have confirmed that I can scp any
file from a node to the head node.  The significant logs are attached inline
below.



I am primarily concerned with:

1) KILL_JOB 305.alpha.genomics.internal node ???
  First stanza below, fourth line.  It seems some information is getting
  truncated / not delivered (the "???").  This appears to be corroborated
  by the following...

2) im_request, job 305.alpha.genomics.internal: command 0
   im_eof, Premature end of message from addr 192.168.0.119:15003
  Last two lines of third stanza below.  Why premature?  I can't
  get either mom to divulge any more information.

3) cannot contact sisters - node 0 failed
  This is in last stanza at bottom.


Indeed, in the e-mailed report I get:

================================================================
PBS Job Id: 305.alpha.genomics.internal
Job Name:   aa
File stage in failed, see below.
Job will be retried later, please investigate and correct problem.
Post job file processing error; job 305.alpha.genomics.internal on host
+node120/1+node120/0+node119/1+node119/0+node118/1+node118/0+node117/1+node117/0+node116/1+node116/0

Unable to copy file 305.alpha.g.OU to
alpha.genomics.internal:/home/widyono/stdout
>>> error from copy
alpha.genomics.internal: Connection refused
,password).
lost connection
>>> end error output

Unable to copy file 305.alpha.g.ER to
alpha.genomics.internal:/home/widyono/stderr
>>> error from copy
alpha.genomics.internal: Connection refused
,password).
lost connection
>>> end error output
================================================================

I am at a loss as to which is cause and which is symptom (or if both are
symptoms of some other cause as yet undetermined); the copy error, or the
internode communication error.

Regards,
Dan W.







Here are the logs from pbs_mom with logging turned way up.


node 119 (a sister):

07/26/2005 15:20:22;0008;   pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp
07/26/2005 15:20:22;0008;   pbs_mom;Job;im_request;connect from 192.168.0.120:1023
07/26/2005 15:20:22;0008;   pbs_mom;Job;305.alpha.genomics.internal;received request 'ALL_OKAY' from 192.168.0.120:1023
07/26/2005 15:20:22;0008;   pbs_mom;Job;305.alpha.genomics.internal;im_request: KILL_JOB 305.alpha.genomics.internal node ???
07/26/2005 15:20:22;0008;   pbs_mom;Job;305.alpha.genomics.internal;kill_job
07/26/2005 15:20:22;0002;   pbs_mom;n/a;run_pelog;userepilog script '/var/spool/PBS/mom_priv/epilogue.precancel' does not exist (cwd: /var/spool/PBS/mom_priv)
07/26/2005 15:20:22;0008;   pbs_mom;Job;305.alpha.genomics.internal;kill_job done
07/26/2005 15:20:22;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from addr 192.168.0.120:1023




node 120 (mother superior for this job):

07/26/2005 15:20:23;0002;   pbs_mom;n/a;mom_set_limits;mom_set_limits(305.alpha.genomics.internal,alter) entered
07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;Job Modified at request of PBS_Server at alpha.genomics.internal
07/26/2005 15:20:23;0008;   pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp
07/26/2005 15:20:23;0008;   pbs_mom;Job;im_request;connect from 192.168.0.119:15003
07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;received request 'ALL_OKAY' from 192.168.0.119:15003
07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;joinjob response received from node 192.168.0.119:15003, (still waiting for node118)
07/26/2005 15:20:23;0008;   pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp

[...]

07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;all sisters have reported in, launching job locally
07/26/2005 15:20:23;0002;   pbs_mom;n/a;mom_close_poll;entered
07/26/2005 15:20:23;0001;   pbs_mom;Job;305.alpha.genomics.internal;phase 2 of job launch successfully completed
07/26/2005 15:20:23;0001;   pbs_mom;Job;TMomFinalizeJob3;read start return code=0 session=13073
07/26/2005 15:20:23;0001;   pbs_mom;Job;305.alpha.genomics.internal;saving task (TMomFinalizeJob3)
07/26/2005 15:20:23;0008;   pbs_mom;Job;task_save;saving task in /var/spool/PBS/mom_priv/jobs/305.alpha.g.TK/0000000001
07/26/2005 15:20:23;0001;   pbs_mom;Job;TMomFinalizeJob3;job 305.alpha.genomics.internal started, pid = 13073
07/26/2005 15:20:23;0001;   pbs_mom;Job;305.alpha.genomics.internal;job successfully started
07/26/2005 15:20:23;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=13073 pid=13073 cputime=0 (cputfactor=1.000000)
07/26/2005 15:20:23;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=13073 pid=13076 cputime=0 (cputfactor=1.000000)
07/26/2005 15:20:23;0008;   pbs_mom;Job;scan_for_terminated;for job 305.alpha.genomics.internal, task 1, pid=13073, exitcode=0
07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;sending signal 9 to task
07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;kill_task: killing pid 13076 task 1 with sig 9
07/26/2005 15:20:23;0008;   pbs_mom;Job;task_save;saving task in /var/spool/PBS/mom_priv/jobs/305.alpha.g.TK/0000000001
07/26/2005 15:20:23;0080;   pbs_mom;Job;305.alpha.genomics.internal;saving task in /var/spool/PBS/mom_priv/jobs/305.alpha.g.TK/0000000001
07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;Terminated
07/26/2005 15:20:23;0008;   pbs_mom;Req;send_sisters;sending command KILL_JOB (2)
07/26/2005 15:20:23;0008;   pbs_mom;Job;task_save;saving task in /var/spool/PBS/mom_priv/jobs/305.alpha.g.TK/0000000001
07/26/2005 15:20:23;0008;   pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp
07/26/2005 15:20:23;0008;   pbs_mom;Job;im_request;connect from 192.168.0.119:15003
07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;received request 'ALL_OKAY' from 192.168.0.119:15003
07/26/2005 15:20:23;0001;   pbs_mom;Job;305.alpha.genomics.internal;KILL_JOB acknowledgement received
07/26/2005 15:20:23;0001;   pbs_mom;Svr;pbs_mom;im_request, job 305.alpha.genomics.internal: command 0
07/26/2005 15:20:23;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.119:15003

[...]

07/26/2005 15:20:23;0008;   pbs_mom;Job;305.alpha.genomics.internal;received request 'ALL_OKAY' from 192.168.0.116:15003
07/26/2005 15:20:23;0001;   pbs_mom;Job;305.alpha.genomics.internal;KILL_JOB acknowledgement received
07/26/2005 15:20:23;0001;   pbs_mom;Svr;pbs_mom;im_request, job 305.alpha.genomics.internal: command 0
07/26/2005 15:20:23;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.116:15003
07/26/2005 15:20:33;0002;   pbs_mom;n/a;is_update_stat;composing status update for server
07/26/2005 15:20:33;0002;   pbs_mom;n/a;totmem;totmem: total mem=4271566848 
07/26/2005 15:20:33;0002;   pbs_mom;n/a;availmem;availmem: free mem=4221247488
07/26/2005 15:20:33;0002;   pbs_mom;n/a;is_update_stat;status update successfully sent to server
07/26/2005 15:20:43;0008;   pbs_mom;Job;mom_main;no active process found
07/26/2005 15:20:43;0001;   pbs_mom;Job;305.alpha.genomics.internal;saving task (main loop)
07/26/2005 15:20:43;0008;   pbs_mom;Job;task_save;saving task in /var/spool/PBS/mom_priv/jobs/305.alpha.g.TK/0000000001
07/26/2005 15:20:43;0008;   pbs_mom;Req;send_sisters;sending command POLL_JOB (7)
07/26/2005 15:20:43;0008;   pbs_mom;Job;305.alpha.genomics.internal;cannot contact sisters - node 0 failed
07/26/2005 15:20:43;0008;   pbs_mom;Job;305.alpha.genomics.internal;Terminated
07/26/2005 15:20:43;0008;   pbs_mom;Req;send_sisters;sending command KILL_JOB (2)
07/26/2005 15:20:43;0008;   pbs_mom;Job;task_save;saving task in /var/spool/PBS/mom_priv/jobs/305.alpha.g.TK/0000000001
07/26/2005 15:20:43;0080;   pbs_mom;Job;305.alpha.genomics.internal;local task termination detected.  killing job
07/26/2005 15:20:43;0008;   pbs_mom;Job;305.alpha.genomics.internal;kill_job
07/26/2005 15:20:43;0002;   pbs_mom;n/a;run_pelog;userepilog script '/var/spool/PBS/mom_priv/epilogue.precancel' does not exist (cwd: /var/spool/PBS/mom_priv)
07/26/2005 15:20:43;0008;   pbs_mom;Job;305.alpha.genomics.internal;kill_job done
07/26/2005 15:20:43;0080;   pbs_mom;Job;305.alpha.genomics.internal;performing job clean-up
07/26/2005 15:20:43;0002;   pbs_mom;n/a;mom_close_poll;entered
07/26/2005 15:20:43;0008;   pbs_mom;Job;scan_for_terminated;pid 13126 not tracked, exitcode=0
07/26/2005 15:20:43;0008;   pbs_mom;Job;process_request;request type CopyFiles from host alpha.genomics.internal received
07/26/2005 15:20:43;0008;   pbs_mom;Job;process_request;request type CopyFiles from host alpha.genomics.internal allowed
07/26/2005 15:20:43;0008;   pbs_mom;Job;dispatch_request;dispatching request CopyFiles on sd=11
07/26/2005 15:20:43;0004;   pbs_mom;Fil;305.alpha.genomics.internal;forking to user, uid: 17406  gid: 2940  homedir: '/home/widyono'
07/26/2005 15:20:43;0002;   pbs_mom;n/a;mom_close_poll;entered


More information about the torqueusers mailing list