[torqueusers] ssh problem

Guillaume Alleon guillaume.alleon at laposte.net
Mon Feb 7 13:42:46 MST 2005


Hi,

I am running torque-1.2.0p0 on a x86_64 machine. It is configured with 
the --with-scp option.
Running a job is ok until the stage out. Then I get the following message:

Host key verification failed.
lost connection

The notification email is the following:

-------------------------------------------------------------------------------------------------------
Message 1:
 From adm at hal  Mon Feb  7 19:57:23 2005
Date: Mon, 7 Feb 2005 19:57:23 +0100
From: adm <adm at hal>
To: alleon at hal
Subject: PBS JOB 39.hal
Precedence: bulk

PBS Job Id: 39.hal
Job Name:   zaza
File stage in failed, see below.
Job will be retried later, please investigate and correct problem.
Post job file processing error; job 39.hal on host 
blade08/1+blade08/0+blade07/1+blade07/0+blade06/1+blade06/0+blade05/1+blade05/0

Unable to copy file 39.hal.OU to hal:/home/alleon/test/zaza.o39
-------------------------------------------------------------------------------------------------------

when I do the copy as the job owner it is OK. The mom logs are not 
telling much

02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type QueueJob request received 
from PBS_Server at hal, sock=10
02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type JobScript request received 
from PBS_Server at hal, sock=10
02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type ReadyToCommit request 
received from PBS_Server at hal, sock=10
02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type Commit request received 
from PBS_Server at hal, sock=10
02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at hal, sock=13
02/07/2005 19:52:40;0001;   pbs_mom;Job;TMomFinalizeJob3;job 39.hal 
started, pid = 10881
02/07/2005 19:52:40;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
located, marking interface closed
02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at hal, sock=10
02/07/2005 19:52:41;0008;   pbs_mom;Job;39.hal;start_process: task 
started, tid 2, sid 10934, cmd /bin/sh
02/07/2005 19:52:41;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
located, marking interface closed
02/07/2005 19:52:41;0008;   pbs_mom;Job;39.hal;start_process: task 
started, tid 3, sid 10936, cmd /bin/sh
02/07/2005 19:53:12;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at hal, sock=11
02/07/2005 19:54:40;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at hal, sock=11
02/07/2005 19:55:40;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at hal, sock=11
02/07/2005 19:56:35;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
39.hal task 2 terminated, sid 10934
02/07/2005 19:56:35;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
39.hal task 3 terminated, sid 10936
02/07/2005 19:56:35;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
located, marking interface closed
02/07/2005 19:56:36;0008;   pbs_mom;Job;39.hal;kill_task: killing pid 
10882 task 1 with sig 9
02/07/2005 19:56:36;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
39.hal task 1 terminated, sid 10881
02/07/2005 19:56:36;0008;   pbs_mom;Job;39.hal;Terminated
02/07/2005 19:56:36;0100;   pbs_mom;Req;;Type CopyFiles request received 
from PBS_Server at hal, sock=10
02/07/2005 19:56:58;0100;   pbs_mom;Req;;Type DeleteJob request received 
from PBS_Server at hal, sock=10

Have you any idea of what I am doing wrong ?
Yours

Guillaume




Host key verification failed.
lost connection


More information about the torqueusers mailing list