Bug 8 - jobs disappear if MOM has no password entry for submit user
: jobs disappear if MOM has no password entry for submit user
Status: NEW
Product: TORQUE
pbs_server
: 2.1.x
: PC Linux
: P5 major
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2009-06-18 04:31 MDT by ramon.bastiaans
Modified: 2009-08-27 20:12 MDT (History)
0 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description ramon.bastiaans 2009-06-18 04:31:24 MDT
If you have 1 node/mom in cluster where (by accident) the user administration
is broken (lost LDAP connection for example), all jobs scheduled there
disappear.

This way you can lose all queued jobs to /dev/null if that particular node
happens to be the only free node. We call this a blackhole node.

When this occurs:

- pbs_mom only logs "No Password Entry for User somebody"
- after which the job disappears (without being executed)
- pbs_server logs no error
- the job is no longer present in the queue(s)

In my opinion a job should be requeued at the server when this occurs. Now the
jobs just disappear. I consider this a bug. In no situation jobs should
disappear from the queues without being run. Either put the job in hold state,
the node offline or something similar.

The prologue is also not executed when this user error occurs, which makes
checking if the execution user exists from prologue impossible.

The 'node health check script' is not sufficient to prevent user errors in case
of LDAP caching (use of nscd). Then you can never anticipate which user the job
might run under and if it's cached or not by nscd.

We run torque v2.1.11
Comment 1 Glen 2009-08-27 20:12:37 MDT
thanks for the bug report, and sorry it took so long for someone to respond. 
I'm going to try to take a look at this or see if I can find another developer
to take a look.