[torqueusers] Torque : pbs_mom stuck with "no password entry for user <someuser>" message
pascal.mayani at eurecom.fr
Tue Dec 7 14:21:03 MST 2010
Henri Marsalet <henri.marsalet <at> yahoo.fr> writes:
> I run a 256 nodes PBS cluster using the latest Torque 2.5.3, under Linux
> Fedora Core 6 with 2.6.22-9 kernel. Users are authenticated by a LDAPS server
> with the native pam_ldap module.
> Most of the time the system is working flawlessly, but sometimes, on a
> perfectly random basis, ALL nodes stop accepting jobs from the PBS server.
> Each time a job is submitted, the following error pops up in the node's
> syslog :
> pbs_mom: LOG_ERROR::start_exec, no password entry for user <name of the user>
I had exactly the same problem on my PBS cluster.
It turned out the client-side packages were compiled in 32-bit mode on the
master node, and then distributed and installed to the compute nodes which run
on a 64-bit version of Linux... Calling some functions in such a case, including
getpwnam(), can sometimes result in weird behaviours.
This could be a fairly common mistake, because the master node has no need for
large memory and consequently often runs on a 32-bit platform.
You can see the MOM's library dependencies with the ldd command. On the 32-bit
linux-gate.so.1 => (0xffffe000)
libutil.so.1 => /lib/libutil.so.1 (0x471ad000)
libtorque.so.2 => /usr/local/lib/libtorque.so.2 (0xf7edb000)
libpthread.so.0 => /lib/libpthread.so.0 (0x471dc000)
libc.so.6 => /lib/libc.so.6 (0x4706e000)
And on the 64-bit :
linux-vdso.so.1 => (0x00007fff689fd000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003eaa200000)
libtorque.so.2 => /usr/local/lib64/libtorque.so.2 (0x00002ad9421e0000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e9c800000)
libc.so.6 => /lib64/libc.so.6 (0x0000003e9bc00000)
So have a look on this first. I guess your test program is irrelevant if it's
linked to the local 64-bit libraries whereas the pbs_mom is still linked to the
More information about the torqueusers