[torqueusers] Sporadic UID errors

David Beer dbeer at adaptivecomputing.com
Mon Jun 25 09:01:29 MDT 2012


Phil,

We have had other customers/users that had this problem due to LDAP failing
sometimes. We added a retry parameter for the moms. You can set it in the
mom's config file, just add the line:

$ext_pwd_retry <num retries>

If you don't really have users going to machines that they shouldn't go to,
then you might want to set this to a fairly high number so that jobs aren't
lost unnecessarily.

David

On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier <pregier at ittc.ku.edu> wrote:

> Oops.  An error and an omission:  I meant 4.0.2 instead of 4.0.4 (trying
> 4.0.3 snapshot now), and it should also be noted that as part of the stress
> test I am constantly watching repeated qstats.  The problem does not seem
> to appear with 4.0.x as such; might this be related to the switch from a
> single-threaded server to multi-threaded?
>
> ----- Original Message -----
> From: "Phil Regier" <pregier at ittc.ku.edu>
> To: torqueusers at supercluster.org
> Sent: Friday, June 22, 2012 2:14:12 PM
> Subject: Sporadic UID errors
>
> Sorry if this has been raised (there is another LDAP thread active but I
> think the problem is very different) before; I'm still going through the
> archives.
>
> I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible
> upgrade from 2.x and have come across some odd behaviors.  In particular,
> when I submit 1000 small jobs to a fake one-node cluster running Torque
> 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can
> retrieve specfiles etc. if that would help) and authenticated against LDAP,
> I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never
> get accepted); for example:
>
> ...
> 14289.localhost
> 14290.localhost
> 14291.localhost
> qsub: Bad UID for job execution MSG=User pregier does not exist in server
> password file
>
> 14293.localhost
> 14294.localhost
> 14295.localhost
> ...
>
>
> This is just a loop; there is no difference between job 14291, 14293, and
> what should have been 14292.
>
> Is this normal?  Are there precautions to avoid it, or is this a bug I
> should be reporting in more detail?
>
> Thanks for any suggestions; I'm not terribly experienced with Torque, so
> I'm not sure how quickly I should be bringing this sort of thing to the
> list.  I can provide more details about my setup and/or stress tests, but
> didn't want to dump too much useless information in my first post.
>
> Phil Regier
> Student assistant system admininstrator
> University of Kansas, ITTC
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120625/cb516fe5/attachment-0001.html 


More information about the torqueusers mailing list