Tapio Simula tapio.simula at gmail.com
Tue Jan 6 19:51:34 MST 2009

Joshua Bernstein wrote:

>  I am trying to set up a queue on a Mac Pro cluster running OSX Leopard
>> 10.5.6 (the following problem existed on an earlier version of Leopard too).
>> Testing only on a single node everything seems to work fine. When I use two
>> nodes, one running pbs_server (and scheduler) and the other pbs_mom, all
>> still works as long as I stay logged on in the node running the pbs_mom. As
>> soon as I log out, the file staging (scp copy) fails.
> When the file staging fails, do you see an error anywhere? What is the
> error? If you are not logged into the node, does the job even get submitted
> to pbs_mom or just rejected outright due to an authentication error?
The error (from the returned email) is essentially due to scp failing:
>>> error from copy
unknown user 504
>>> end error output

If I do not stage-in anything then the job runs fine (irrespective of
whether the authentication works properly or not) but the output stays
undelivered on the momhost. If I try to stage-in any files then the failure
occurs straight away. The above error is the same I get on command line if I
try to scp a file from an interactive job (in the case of faulty
authentication). Also, it is not enough if I am just logged into the node
running the mom, the pbs_mom must have been started during the same login
session. If I log out and immediately back again then the authentication is
broken again.

errors from syslog:
Wed Jan  7 11:15:55 momhost pbs_mom[50120] <Error>: sys_copy, command
'/usr/bin/scp -rpB /var/spool/torque/spool/6.serverhost.OU
username at serverhost:/Users/username/tst.o6' failed with status=255, giving
up after 4 attempts
Wed Jan  7 11:15:55 momhost pbs_mom[50120] <Error>: req_cpyfile, Unable to
copy file /var/spool/torque/spool/6.serverhost.OU to username at serverhost
