[torqueusers] PBS log file copies
Tom Rosmond
rosmond at reachone.com
Fri Sep 7 09:40:12 MDT 2012
Hello,
I have installed and configured TORQUE on a small (2 socket, 8
cores/socket) Debian linux server, with NUMA and CPUSETS enabled.
Everything is working very well, with one exception: The PBS logfiles
from the '-o' command line option are not being copied to the desired
destination. The copies fail, so the files stay in the 'undelivered'
directory. Here is an output fragment from 'daemon.log':
------------------------------- snip ----------------------------------
Sep 6 17:19:06 fir pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp
-rpB /var/spool/torque/spool/56.localhost.OU
rosmond at fir:/scr/rosmond/testda//arscrpt/update_ar_semi_1_2010081818'
failed with status=1, giving up after 4 attempts
Sep 6 17:19:06 fir pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
file /var/spool/torque/spool/56.localhost.OU to
rosmond at fir:/scr/rosmond/testda//arscrpt/update_ar_semi_1_2010081818
Sep 6 17:19:06 fir pbs_mom: LOG_ERROR::req_cpyfile, #012#012Unable to
copy file /var/spool/torque/spool/56.localhost.OU to
rosmond at fir:/scr/rosmond/testda//arscrpt/update_ar_semi_1_2010081818#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/torque/undelivered/56.localhost.OU
-------------------------------- snip --------------------------------
I interpret the error as 'scp' trying to copy from one physical node to
another at ( rosmond at fir ). But since these are NUMA nodes, this can't
work, but a simple 'cp' to the destination location and file name would.
Is this correct? If so, what do I need to do to configure the system to
get successful copies?
BTW, we have 2 other nearly identical NUMA systems that don't have this
problem. I have tried to mimic their configuration as closely as
possible, but something must be different, but I can't find it.
T. Rosmond
More information about the torqueusers
mailing list