[torqueusers] ssh problem

Guillaume Alleon guillaume.alleon at laposte.net
Mon Feb 7 15:51:46 MST 2005


I am not sure of your remark. All my nodes & frontend have only short 
names stored in /etc/hosts.
Could that be the problem....?
Guillaume

Wightman wrote:

>Make sure you can execute the 'scp' commands using the Fully-Qualified
>domain name.  The alias is sometimes not enough.
>
>- Douglas
>Cluster Resources, INC.
>
>
>On Mon, 2005-02-07 at 23:13 +0100, Guillaume Alleon wrote:
>  
>
>>As the job owner I can do it ... when logged on bladexx, I can do
>>      scp /var/local/torque/spool/y.hal.OU alleon at hal:/home/alleon
>>without any problem.
>>I really don't understand the problem !! Usually when somethin is wrong 
>>the files are then located in the
>>undelivered directory. In my case, they are still in the spool directory ?
>>
>>
>>
>>Kevin Van Workum wrote:
>>
>>    
>>
>>>make sure that you can scp from blade* to your pbs_server machine without 
>>>using a password (ssh_key authentication).
>>>
>>>On Mon, 7 Feb 2005, Guillaume Alleon wrote:
>>>
>>> 
>>>
>>>      
>>>
>>>>Hi,
>>>>
>>>>I am running torque-1.2.0p0 on a x86_64 machine. It is configured with 
>>>>the --with-scp option.
>>>>Running a job is ok until the stage out. Then I get the following message:
>>>>
>>>>Host key verification failed.
>>>>lost connection
>>>>
>>>>The notification email is the following:
>>>>
>>>>-------------------------------------------------------------------------------------------------------
>>>>Message 1:
>>>>From adm at hal  Mon Feb  7 19:57:23 2005
>>>>Date: Mon, 7 Feb 2005 19:57:23 +0100
>>>>From: adm <adm at hal>
>>>>To: alleon at hal
>>>>Subject: PBS JOB 39.hal
>>>>Precedence: bulk
>>>>
>>>>PBS Job Id: 39.hal
>>>>Job Name:   zaza
>>>>File stage in failed, see below.
>>>>Job will be retried later, please investigate and correct problem.
>>>>Post job file processing error; job 39.hal on host 
>>>>blade08/1+blade08/0+blade07/1+blade07/0+blade06/1+blade06/0+blade05/1+blade05/0
>>>>
>>>>Unable to copy file 39.hal.OU to hal:/home/alleon/test/zaza.o39
>>>>-------------------------------------------------------------------------------------------------------
>>>>
>>>>when I do the copy as the job owner it is OK. The mom logs are not 
>>>>telling much
>>>>
>>>>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type QueueJob request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=10
>>>      
>>>
>>>>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type JobScript request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=10
>>>      
>>>
>>>>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type ReadyToCommit request 
>>>>received from PBS_Server at hal, sock=10
>>>>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type Commit request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=10
>>>      
>>>
>>>>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type StatusJob request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=13
>>>      
>>>
>>>>02/07/2005 19:52:40;0001;   pbs_mom;Job;TMomFinalizeJob3;job 39.hal 
>>>>started, pid = 10881
>>>>02/07/2005 19:52:40;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
>>>>located, marking interface closed
>>>>02/07/2005 19:52:40;0100;   pbs_mom;Req;;Type StatusJob request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=10
>>>      
>>>
>>>>02/07/2005 19:52:41;0008;   pbs_mom;Job;39.hal;start_process: task 
>>>>started, tid 2, sid 10934, cmd /bin/sh
>>>>02/07/2005 19:52:41;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
>>>>located, marking interface closed
>>>>02/07/2005 19:52:41;0008;   pbs_mom;Job;39.hal;start_process: task 
>>>>started, tid 3, sid 10936, cmd /bin/sh
>>>>02/07/2005 19:53:12;0100;   pbs_mom;Req;;Type StatusJob request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=11
>>>      
>>>
>>>>02/07/2005 19:54:40;0100;   pbs_mom;Req;;Type StatusJob request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=11
>>>      
>>>
>>>>02/07/2005 19:55:40;0100;   pbs_mom;Req;;Type StatusJob request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=11
>>>      
>>>
>>>>02/07/2005 19:56:35;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
>>>>39.hal task 2 terminated, sid 10934
>>>>02/07/2005 19:56:35;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
>>>>39.hal task 3 terminated, sid 10936
>>>>02/07/2005 19:56:35;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task 
>>>>located, marking interface closed
>>>>02/07/2005 19:56:36;0008;   pbs_mom;Job;39.hal;kill_task: killing pid 
>>>>10882 task 1 with sig 9
>>>>02/07/2005 19:56:36;0080;   pbs_mom;Job;39.hal;scan_for_terminated: job 
>>>>39.hal task 1 terminated, sid 10881
>>>>02/07/2005 19:56:36;0008;   pbs_mom;Job;39.hal;Terminated
>>>>02/07/2005 19:56:36;0100;   pbs_mom;Req;;Type CopyFiles request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=10
>>>      
>>>
>>>>02/07/2005 19:56:58;0100;   pbs_mom;Req;;Type DeleteJob request received 
>>>>        
>>>>
>>>>from PBS_Server at hal, sock=10
>>>      
>>>
>>>>Have you any idea of what I am doing wrong ?
>>>>Yours
>>>>
>>>>Guillaume
>>>>
>>>>
>>>>
>>>>
>>>>Host key verification failed.
>>>>lost connection
>>>>_______________________________________________
>>>>torqueusers mailing list
>>>>torqueusers at supercluster.org
>>>>http://supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>> 
>>>
>>>      
>>>
>>_______________________________________________
>>torqueusers mailing list
>>torqueusers at supercluster.org
>>http://supercluster.org/mailman/listinfo/torqueusers
>>    
>>
>
>
>  
>


More information about the torqueusers mailing list