[torqueusers] scp error

Garrick Staples garrick at clusterresources.com
Thu Nov 30 12:54:03 MST 2006


On Thu, Nov 30, 2006 at 03:15:44PM +0100, LEROY Christine alleged:
> Hello,
> 
>  
> 
> We are using torque and maui beside our grid middleware, and users are
> complaining that there jobs are sometimes failing with no output.
> 
> We had a look in our logs and we can see those errors:
> 
>  
> 
> Nov 30 02:18:31 wn021 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
> /var/spool/pbs/spool/87831.node0.OU
> atlp at node07.datagrid.cea.fr:/home/atlp/.lcgjm/globus-cache-export.Y30406
> /batch.out' failed with status=1, giving up after 4 attempts
> 
> Nov 30 02:18:36 wn021 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
> /var/spool/pbs/spool/87831.node0.ER
> atlp at node07.datagrid.cea.fr:/home/atlp/.lcgjm/globus-cache-export.Y30406
> /batch.err' failed with status=1, giving up after 4 attempts
> 
>  
> 
> (node07.datagrid.cea.fr is our pbs server, and wn021 is one of our nodes
> where pbs_mom is running)
> 
>  
> 
> Are those  file "/var/spool/pbs/spool/87831.node0.OU" and
> "/var/spool/pbs/spool/87831.node0.ER " deleted too soon by the system on
> the pbs_mom node?
> 
> Or is it possible to configure the number of attempts ?
> 
>  
> 
> Thanks in advance for your help.
> 
> Cheers
> 
> Christine
> 
>  
> 
>  
> 
> PS : We have also the same type of error but at the beginning of the job
> :
> 
>  
> 
> Nov 30 04:40:21 wn021 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
> fus176 at node07.datagrid.cea.fr:/home/fus176/.lcgjm/globus-cache-export.g2
> 2960/globus-cache-export.g22960.gpg globus-cache-export.g22960.gpg'
> failed with status=1, giving up after 4 attempts
> 

The number of tries isn't configurable, and IMHO doesn't need to be
because, generally speaking, any failure will just repeat until it gives
up.  Meaning that 4 tries is as good as 1 try, and is as good as 50
tries.

Make sure you are on 2.1.6, 2.1.4 and 2.1.5 have some things broken in
this area.

Since this is likely an ssh configuration error, the exact error message
should have been sent to the user in an email.

If /home is shared on your cluster, add suitable $usecp lines to your
MOM config so that scp isn't used anymore.



More information about the torqueusers mailing list