[torqueusers] PBS log file copies

Tom Rosmond rosmond at reachone.com
Fri Sep 7 09:40:12 MDT 2012


I have installed and configured TORQUE on a small (2 socket, 8
cores/socket) Debian linux server, with NUMA and CPUSETS enabled.
Everything is working very well, with one exception:  The PBS logfiles
from the '-o' command line option are not being copied to the desired
destination.  The copies fail, so the files stay in the 'undelivered'
directory.  Here is an output fragment from 'daemon.log':

------------------------------- snip ----------------------------------

Sep  6 17:19:06 fir pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp
-rpB /var/spool/torque/spool/56.localhost.OU
rosmond at fir:/scr/rosmond/testda//arscrpt/update_ar_semi_1_2010081818'
failed with status=1, giving up after 4 attempts
Sep  6 17:19:06 fir pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
file /var/spool/torque/spool/56.localhost.OU to
rosmond at fir:/scr/rosmond/testda//arscrpt/update_ar_semi_1_2010081818
Sep  6 17:19:06 fir pbs_mom: LOG_ERROR::req_cpyfile, #012#012Unable to
copy file /var/spool/torque/spool/56.localhost.OU to
rosmond at fir:/scr/rosmond/testda//arscrpt/update_ar_semi_1_2010081818#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/torque/undelivered/56.localhost.OU

-------------------------------- snip --------------------------------

I interpret the error as 'scp' trying to copy from one physical node to
another at ( rosmond at fir ).  But since these are NUMA nodes, this can't
work, but a simple 'cp' to the destination location and file name would.
Is this correct?  If so, what do I need to do to configure the system to
get successful copies?

BTW, we have 2 other nearly identical NUMA systems that don't have this
problem.  I have tried to mimic their configuration as closely as
possible, but something must be different, but I can't find it.

T. Rosmond

