[torqueusers] Torque : pbs_mom stuck with "no password entry for user <someuser>" message

Mon Dec 6 08:17:53 MST 2010


I run a 256 nodes PBS cluster using the latest Torque 2.5.3, under Linux Fedora
Core 6 with 2.6.22-9 kernel. Users are authenticated by a LDAPS server with the
native pam_ldap module.

Most of the time the system is working flawlessly, but sometimes, on a perfectly
random basis, ALL nodes stop accepting jobs from the PBS server. Each time a job
is submitted, the following error pops up in the node's syslog :

pbs_mom: LOG_ERROR::start_exec, no password entry for user <name of the user>

There is no authentication problem on the nodes though : the "getent passwd
<someuser>" command returns a correct value, and I can also log on any nodes
under the specified user ID.

At this point, the only way to make the nodes working again seems to kill and
restart all the pbs_mom processes. I've tried before to restart Autofs and NSCD
daemons with no success.

I'm fighting with this issue since two weeks now. I've seen I'm not the only one
to have this trouble but there is no fix so far. Maybe the culprit is a bug in
the src/resmom/start_exec.c source file...

Does anybody have a clue on this ?



