[torqueusers] OS X Leopard, torque, authentication problem

Glen Beane glen.beane at gmail.com
Tue Jan 6 09:31:10 MST 2009

On Mon, Jan 5, 2009 at 9:34 PM, Tapio Simula <tapio.simula at gmail.com> wrote:
> I am trying to set up a queue on a Mac Pro cluster running OSX Leopard
> 10.5.6 (the following problem existed on an earlier version of Leopard too).
> Testing only on a single node everything seems to work fine. When I use two
> nodes, one running pbs_server (and scheduler) and the other pbs_mom, all
> still works as long as I stay logged on in the node running the pbs_mom. As
> soon as I log out, the file staging (scp copy) fails. Restarting pbs_mom
> fixes the issue but again for only as long as I stay logged on in the node
> running the mom. Below is an example of an interactive job (when I have
> logged out from the momhost) which may point to an authentication issue?
> The interactive job starts, runs and exits but the user name matching the
> uid cannot be read from the (local) database for some reason (id returns the
> correct uid but no user name) and this is also the reason that the file
> staging fails for normal jobs.
> -----------------------------------------------
> serverhost:~ myusername$ qsub -I
> qsub: waiting for job 30.mydomain to start
> qsub: job 30.mydomain ready
> momhost:~ I have no name!$ dscl . -read /Users
> Operation failed with error: eServerNotRunning
> momhost:~ I have no name!$ dscacheutil -flushcache
> Flushcache failed, unable to talk to daemon
> momhost:~ I have no name!$ exit
> logout
> qsub: job 30.mydomain completed
> serverhost:~ myusername$
> -----------------------------------------------
> ssh/scp works fine both ways without prompting passwords. All of the above
> is also independent of the scheduler (using Maui or pbs_sched yield the same
> results). I am currently testing torque-2.4.0b1 but have the same issue with
> various earlier versions of torque.
> Any help on how to get torque working with Leopard would be much
> appreciated.
> Kind regards,
> Tapio

With Leopard things became flaky with TORQUE.  I would love to get
Leopard supported by TORQUE.  You won't find any better Leopard
support in 2.4.0b than you will in the 2.3 branch (2.4.0 has not been
officially released yet)

Right now I see two problems on Leopard which do not exist on Tiger:

1) the authentication problem you see.  I had no idea what was causing
this, but from what you have posted it seems like perhaps the
DirectoryService daemon is getting started on demand at login time

2) a problem somewhere inside of the code enabled by the HAVE_WORDEXP
macro which is set by configure if the wordexp function exists on the
system.  This code works on other operating systems with wordexp,
including Tiger, but it causes a child process of pbs_mom that copies
the stderr and stdout files back to the user to crash and the job gets
stuck in the "E" state.

A work around for this is after running configure,  open up
src/include/pbs_config.h, and search for HAVE_WORDEXP.  Comment out
the #define HAVE_WORDEXP 1

this means you can uses environment variables in your stderr and
stdout paths.  I haven't had a lot of time to look into this, but I
have started writing some test code to try to reproduce the crash in a
very short test program so I can isolate the problem and possibly get
a bug report to Apple.  Since this works on every other OS I think it
is probably a bug in OS X, but I am not certain at this point.

I would like to fix both of these issues - in the Panther and then
Tiger days I had a 512 processor OS X cluster, so I was very active in
maintaining TORQUE for OS X (Justin Bronder and I fixed bugs on OS X
and when many things broke in TORQUE when Tiger was released we were
the ones that got it working again).  I no longer have a OS X cluster,
which makes it hard for me to really understand all the issues.

More information about the torqueusers mailing list