[torqueusers] Sporadic UID errors

Phil Regier pregier at ittc.ku.edu
Mon Jun 25 09:36:07 MDT 2012

Nice; that's pretty slick!  I'm sure that will solve the problem; I'll switch back to 3.0.5 in a bit to try it out.



----- Original Message -----
From: "David Beer" <dbeer at adaptivecomputing.com>
To: "Torque Users Mailing List" <torqueusers at supercluster.org>
Sent: Monday, June 25, 2012 10:01:29 AM
Subject: Re: [torqueusers] Sporadic UID errors


We have had other customers/users that had this problem due to LDAP failing sometimes. We added a retry parameter for the moms. You can set it in the mom's config file, just add the line: 

$ext_pwd_retry <num retries> 

If you don't really have users going to machines that they shouldn't go to, then you might want to set this to a fairly high number so that jobs aren't lost unnecessarily. 


On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu > wrote: 

Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded? 

----- Original Message ----- 
From: "Phil Regier" < pregier at ittc.ku.edu > 
To: torqueusers at supercluster.org 
Sent: Friday, June 22, 2012 2:14:12 PM 
Subject: Sporadic UID errors 

Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives. 

I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example: 

qsub: Bad UID for job execution MSG=User pregier does not exist in server password file 


This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292. 

Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail? 

Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post. 

Phil Regier 
Student assistant system admininstrator 
University of Kansas, ITTC 
torqueusers mailing list 
torqueusers at supercluster.org 


David Beer | Software Engineer 
Adaptive Computing 

torqueusers mailing list
torqueusers at supercluster.org

More information about the torqueusers mailing list