[torqueusers] output files not being delivered

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Wed Oct 15 13:37:05 MDT 2008

After reading through the threads, I thought I have everything setup 
correctly for having my output files be delivered to my home dir, but 
apparently not.  I am running torque-2.3.2-snap.200807231134  and  
maui-3.2.6p19, with xcat1.3.0.  torque was compiled --with-scp.    
Permissions for /var/spool/torque/spool on both the head and compute 
nodes is:   drwxrwxrwt 2 root root 4096 Oct 15 15:12 spool

My home dirs are NFS mounted across all nodes.  /opt/xcat/etc/gkh has 
all of the ssh keys for each node.  /root/.ssh dir on each node contains 
a  config file that has the info for pointing to the ssh keys.  
Although, not sure I need the ssh keys if I am coping to the nfs mounted 
dirs, (usecp).  The /var/spool/torque/mom_priv/config file has the 
following entries:  (cluster domain name is spartans).  userB is the 
user node I am working with.

Node: /var/spool/torque/mom_prive/config file
$logevent 0x1ff
$clienthost  nona-man
$tmpdir /scr
$usecp nona-man.spartans:/spartans1 /fs/spartans1
$usecp nona-man.spartans:/spartans2 /fs/spartans2
$usecp nona-man.spartans:/spartans3 /fs/spartans3
$usecp userA.spartans:/userA/u1 /fs/userA1
$usecp userA.spartans:/userA/u2 /fs/userA2
$usecp userB.spartans:/userB/u1 /fs/userB1
$usecp userB.spartans:/userB/u2 /fs/userB2

This is a new cluster, so it is the first time these nodes are 
requesting the nfs mounted dirs.  What happens is I submit a job to a 
node, it will not cd to the nfs mounted.  Although, when I log into the 
node, I can cd to the desired dir without issue.

Oct 15 15:26:48 node1048 pbs_mom: No such file or directory (2) in 
TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat' failed: No such 
file or directory 
Oct 15 15:26:48 node1048 pbs_mom: No such file or directory (2) in 
open_std_file, cannot open/create stdout/stderr file 
Oct 15 15:26:57 node1048 pbs_mom: sys_copy, command '/usr/bin/scp -rpB 
mfitzpat at userB:/fs/userB1/mfitzpat/test3.o' failed with status=1, giving 
up after 4 attempts
Oct 15 15:26:57 node1048 pbs_mom: req_cpyfile, Unable to copy file 
/var/spool/torque/spool/860.nona-man.OU to 
mfitzpat at userB:/fs/userB1/mfitzpat/test3.o

The head node and compute nodes can resolve the short and fqdn of all 
systems.  But the user node(userB), can only resolve the FQDN.  Is that 
the issue?
userB:~$ host userB
Host userB not found: 3(NXDOMAIN)

userB:~$ host userB.spartans
userB.spartans has address

What am I missing?

Mary Ellen

