Bugzilla – Bug 8
jobs disappear if MOM has no password entry for submit user
Last modified: 2009-08-27 20:12:37 MDT
You need to log in before you can comment on or make changes to this bug.
If you have 1 node/mom in cluster where (by accident) the user administration is broken (lost LDAP connection for example), all jobs scheduled there disappear. This way you can lose all queued jobs to /dev/null if that particular node happens to be the only free node. We call this a blackhole node. When this occurs: - pbs_mom only logs "No Password Entry for User somebody" - after which the job disappears (without being executed) - pbs_server logs no error - the job is no longer present in the queue(s) In my opinion a job should be requeued at the server when this occurs. Now the jobs just disappear. I consider this a bug. In no situation jobs should disappear from the queues without being run. Either put the job in hold state, the node offline or something similar. The prologue is also not executed when this user error occurs, which makes checking if the execution user exists from prologue impossible. The 'node health check script' is not sufficient to prevent user errors in case of LDAP caching (use of nscd). Then you can never anticipate which user the job might run under and if it's cached or not by nscd. We run torque v2.1.11
thanks for the bug report, and sorry it took so long for someone to respond. I'm going to try to take a look at this or see if I can find another developer to take a look.