[torqueusers] ssh problem

Wightman wightman at clusterresources.com
Mon Feb 7 15:26:19 MST 2005


Make sure you can execute the 'scp' commands using the Fully-Qualified
domain name.  The alias is sometimes not enough.

- Douglas
Cluster Resources, INC.


On Mon, 2005-02-07 at 23:13 +0100, Guillaume Alleon wrote:
> As the job owner I can do it ... when logged on bladexx, I can do
>       scp /var/local/torque/spool/y.hal.OU alleon at hal:/home/alleon
> without any problem.
> I really don't understand the problem !! Usually when somethin is wrong 
> the files are then located in the
> undelivered directory. In my case, they are still in the spool directory ?
> 
> 
> 
> Kevin Van Workum wrote:
> 
> >make sure that you can scp from blade* to your pbs_server machine without 
> >using a password (ssh_key authentication).
> >
> >On Mon, 7 Feb 2005, Guillaume Alleon wrote:
> >
> >  
> >
> >>Hi,
> >>
> >>I am running torque-1.2.0p0 on a x86_64 machine. It is configured with 
> >>the --with-scp option.
> >>Running a job is ok until the stage out. Then I get the following message:
> >>
> >>Host key verification failed.
> >>lost connection
> >>
> >>The notification email is the following:
> >>
> >>-------------------------------------------------------------------------------------------------------
> >>Message 1:
> >> From adm at hal  Mon Feb  7 19:57:23 2005
> >>Date: Mon, 7 Feb 2005 19:57:23 +0100
> >>From: adm <adm at hal>
> >>To: alleon at hal
> >>Subject: PBS JOB 39.hal
> >>Precedence: bulk
> >>
> >>PBS Job Id: 39.hal
> >>Job Name:   zaza
> >>File stage in failed, see below.
> >>Job will be retried later, please investigate and correct problem.
> >>Post job file processing error; job 39.hal on host 
> >>blade08/1+blade08/0+blade07/1+blade07/0+blade06/1+blade06/0+blade05/1+blade05/0
> >>
> >>Unable to copy file 39.hal.OU to hal:/home/alleon/test/zaza.o39
> >>-------------------------------------------------------------------------------------------------------
> >>
> >>when I do the copy as the job owner it is OK. The mom logs are not 
> >>telling much
> >>
> >>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type QueueJob request received 
> >>from PBS_Server at hal, sock=10
> >>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type JobScript request received 
> >>from PBS_Server at hal, sock=10
> >>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type ReadyToCommit request 
> >>received from PBS_Server at hal, sock=10
> >>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type Commit request received 
> >>from PBS_Server at hal, sock=10
> >>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type StatusJob request received 
> >>from PBS_Server at hal, sock=13
> >>02/07/2005 19:52:40;0001;   pbs_mom;Job;TMomFinalizeJob3;job 39.hal 
> >>started, pid = 10881
> >>02/07/2005 19:52:40;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
> >>located, marking interface closed
> >>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type StatusJob request received 
> >>from PBS_Server at hal, sock=10
> >>02/07/2005 19:52:41;0008;   pbs_mom;Job;39.hal;start_process: task 
> >>started, tid 2, sid 10934, cmd /bin/sh
> >>02/07/2005 19:52:41;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
> >>located, marking interface closed
> >>02/07/2005 19:52:41;0008;   pbs_mom;Job;39.hal;start_process: task 
> >>started, tid 3, sid 10936, cmd /bin/sh
> >>02/07/2005 19:53:12;0100;   pbs_mom;Req;;Type StatusJob request received 
> >>from PBS_Server at hal, sock=11
> >>02/07/2005 19:54:40;0100;   pbs_mom;Req;;Type StatusJob request received 
> >>from PBS_Server at hal, sock=11
> >>02/07/2005 19:55:40;0100;   pbs_mom;Req;;Type StatusJob request received 
> >>from PBS_Server at hal, sock=11
> >>02/07/2005 19:56:35;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
> >>39.hal task 2 terminated, sid 10934
> >>02/07/2005 19:56:35;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
> >>39.hal task 3 terminated, sid 10936
> >>02/07/2005 19:56:35;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
> >>located, marking interface closed
> >>02/07/2005 19:56:36;0008;   pbs_mom;Job;39.hal;kill_task: killing pid 
> >>10882 task 1 with sig 9
> >>02/07/2005 19:56:36;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
> >>39.hal task 1 terminated, sid 10881
> >>02/07/2005 19:56:36;0008;   pbs_mom;Job;39.hal;Terminated
> >>02/07/2005 19:56:36;0100;   pbs_mom;Req;;Type CopyFiles request received 
> >>from PBS_Server at hal, sock=10
> >>02/07/2005 19:56:58;0100;   pbs_mom;Req;;Type DeleteJob request received 
> >>from PBS_Server at hal, sock=10
> >>
> >>Have you any idea of what I am doing wrong ?
> >>Yours
> >>
> >>Guillaume
> >>
> >>
> >>
> >>
> >>Host key verification failed.
> >>lost connection
> >>_______________________________________________
> >>torqueusers mailing list
> >>torqueusers at supercluster.org
> >>http://supercluster.org/mailman/listinfo/torqueusers
> >>
> >>    
> >>
> >
> >  
> >
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list