[torqueusers] Sporadic UID errors

Phil Regier pregier at ittc.ku.edu
Mon Jun 25 09:36:07 MDT 2012


Nice; that's pretty slick!  I'm sure that will solve the problem; I'll switch back to 3.0.5 in a bit to try it out.

Thanks!

Phil

----- Original Message -----
From: "David Beer" <dbeer at adaptivecomputing.com>
To: "Torque Users Mailing List" <torqueusers at supercluster.org>
Sent: Monday, June 25, 2012 10:01:29 AM
Subject: Re: [torqueusers] Sporadic UID errors


Phil, 

We have had other customers/users that had this problem due to LDAP failing sometimes. We added a retry parameter for the moms. You can set it in the mom's config file, just add the line: 

$ext_pwd_retry <num retries> 

If you don't really have users going to machines that they shouldn't go to, then you might want to set this to a fairly high number so that jobs aren't lost unnecessarily. 

David 


On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu > wrote: 


Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded? 



----- Original Message ----- 
From: "Phil Regier" < pregier at ittc.ku.edu > 
To: torqueusers at supercluster.org 
Sent: Friday, June 22, 2012 2:14:12 PM 
Subject: Sporadic UID errors 

Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives. 

I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example: 

... 
14289.localhost 
14290.localhost 
14291.localhost 
qsub: Bad UID for job execution MSG=User pregier does not exist in server password file 

14293.localhost 
14294.localhost 
14295.localhost 
... 


This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292. 

Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail? 

Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post. 

Phil Regier 
Student assistant system admininstrator 
University of Kansas, ITTC 
_______________________________________________ 
torqueusers mailing list 
torqueusers at supercluster.org 
http://www.supercluster.org/mailman/listinfo/torqueusers 



-- 

David Beer | Software Engineer 
Adaptive Computing 

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list