[torqueusers] OS X Leopard, torque, authentication problem
jbernstein at penguincomputing.com
Wed Jan 7 14:37:11 MST 2009
Tapio Simula wrote:
> Joshua Bernstein wrote:
> I am trying to set up a queue on a Mac Pro cluster running OSX
> Leopard 10.5.6 (the following problem existed on an earlier
> version of Leopard too). Testing only on a single node
> everything seems to work fine. When I use two nodes, one running
> pbs_server (and scheduler) and the other pbs_mom, all still
> works as long as I stay logged on in the node running the
> pbs_mom. As soon as I log out, the file staging (scp copy) fails.
> When the file staging fails, do you see an error anywhere? What is
> the error? If you are not logged into the node, does the job even
> get submitted to pbs_mom or just rejected outright due to an
> authentication error?
> The error (from the returned email) is essentially due to scp failing:
> >>> error from copy
> unknown user 504
> >>> end error output
> If I do not stage-in anything then the job runs fine (irrespective of
> whether the authentication works properly or not) but the output stays
> undelivered on the momhost. If I try to stage-in any files then the
> failure occurs straight away. The above error is the same I get on
> command line if I try to scp a file from an interactive job (in the case
> of faulty authentication). Also, it is not enough if I am just logged
> into the node running the mom, the pbs_mom must have been started during
> the same login session. If I log out and immediately back again then the
> authentication is broken again.
Interesting. How are user accounts managed? Is it possible that the MOM
node is using Open Directory or some other federated naming server, even
something like NIS? The user 504 error is the user id (UID). I can't
think of a case where OS X would dynamically create UIDs when you login.
I assume the UID is the same every time (504?). Perhaps you need to make
sure that your UID on the main system is the same as it is on the remote
In Linux you can use the getent command to check this. I'm not on my Mac
at the moment, but it still may work:
[ats at goldstar NebraLonLatP180]$ getent passwd ats
Here my UID and GID is 5164 for user ats. Perhaps its important that
this number be the same, for SCP's sake.
> errors from syslog:
> Wed Jan 7 11:15:55 momhost pbs_mom <Error>: sys_copy, command
> '/usr/bin/scp -rpB /var/spool/torque/spool/6.serverhost.OU
> username at serverhost:/Users/username/tst.o6' failed with status=255,
> giving up after 4 attempts
I imagine that there is something more realistic for "username" and
"serverhost" but I just thought I'd ask.
More information about the torqueusers