[torqueusers] output files not being delivered
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Wed Oct 15 13:37:05 MDT 2008
Hi,
After reading through the threads, I thought I have everything setup
correctly for having my output files be delivered to my home dir, but
apparently not. I am running torque-2.3.2-snap.200807231134 and
maui-3.2.6p19, with xcat1.3.0. torque was compiled --with-scp.
Permissions for /var/spool/torque/spool on both the head and compute
nodes is: drwxrwxrwt 2 root root 4096 Oct 15 15:12 spool
My home dirs are NFS mounted across all nodes. /opt/xcat/etc/gkh has
all of the ssh keys for each node. /root/.ssh dir on each node contains
a config file that has the info for pointing to the ssh keys.
Although, not sure I need the ssh keys if I am coping to the nfs mounted
dirs, (usecp). The /var/spool/torque/mom_priv/config file has the
following entries: (cluster domain name is spartans). userB is the
user node I am working with.
Node: /var/spool/torque/mom_prive/config file
$logevent 0x1ff
$clienthost nona-man
$tmpdir /scr
$usecp nona-man.spartans:/spartans1 /fs/spartans1
$usecp nona-man.spartans:/spartans2 /fs/spartans2
$usecp nona-man.spartans:/spartans3 /fs/spartans3
$usecp userA.spartans:/userA/u1 /fs/userA1
$usecp userA.spartans:/userA/u2 /fs/userA2
$usecp userB.spartans:/userB/u1 /fs/userB1
$usecp userB.spartans:/userB/u2 /fs/userB2
This is a new cluster, so it is the first time these nodes are
requesting the nfs mounted dirs. What happens is I submit a job to a
node, it will not cd to the nfs mounted. Although, when I log into the
node, I can cd to the desired dir without issue.
Oct 15 15:26:48 node1048 pbs_mom: No such file or directory (2) in
TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat' failed: No such
file or directory
Oct 15 15:26:48 node1048 pbs_mom: No such file or directory (2) in
open_std_file, cannot open/create stdout/stderr file
'/var/spool/torque/spool/860.nona-man.ER'
Oct 15 15:26:57 node1048 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
/var/spool/torque/spool/860.nona-man.OU
mfitzpat at userB:/fs/userB1/mfitzpat/test3.o' failed with status=1, giving
up after 4 attempts
Oct 15 15:26:57 node1048 pbs_mom: req_cpyfile, Unable to copy file
/var/spool/torque/spool/860.nona-man.OU to
mfitzpat at userB:/fs/userB1/mfitzpat/test3.o
The head node and compute nodes can resolve the short and fqdn of all
systems. But the user node(userB), can only resolve the FQDN. Is that
the issue?
userB:~$ host userB
Host userB not found: 3(NXDOMAIN)
userB:~$ host userB.spartans
userB.spartans has address 172.20.0.3
What am I missing?
--
Thanks
Mary Ellen
More information about the torqueusers
mailing list