[torqueusers] Torque : pbs_mom stuck with "no password entry for user <someuser>" message
knielson at adaptivecomputing.com
Mon Dec 6 12:13:11 MST 2010
On 12/06/2010 08:17 AM, Henri Marsalet wrote:
> I run a 256 nodes PBS cluster using the latest Torque 2.5.3, under Linux Fedora
> Core 6 with 2.6.22-9 kernel. Users are authenticated by a LDAPS server with the
> native pam_ldap module.
> Most of the time the system is working flawlessly, but sometimes, on a perfectly
> random basis, ALL nodes stop accepting jobs from the PBS server. Each time a job
> is submitted, the following error pops up in the node's syslog :
> pbs_mom: LOG_ERROR::start_exec, no password entry for user<name of the user>
> There is no authentication problem on the nodes though : the "getent passwd
> <someuser>" command returns a correct value, and I can also log on any nodes
> under the specified user ID.
> At this point, the only way to make the nodes working again seems to kill and
> restart all the pbs_mom processes. I've tried before to restart Autofs and NSCD
> daemons with no success.
> I'm fighting with this issue since two weeks now. I've seen I'm not the only one
> to have this trouble but there is no fix so far. Maybe the culprit is a bug in
> the src/resmom/start_exec.c source file...
> Does anybody have a clue on this ?
It appears you have already looked at the code. The problem is the
getpwnam has returned NULL for the user given. Can you give us the user
name that is failing?
More information about the torqueusers