[torqueusers] Sporadic UID errors

Phil Regier pregier at ittc.ku.edu
Mon Jun 25 09:54:03 MDT 2012


I'm switching back and forth in an attempt to ascertain which upgrade would be less painful.  There is a much uglier issue with 4.0.x that rears its head when I least expect it; I can't quite reproduce it consistently yet, but I'm seeing sporadic configuration database corruption (possibly related to Brian Andrus' "+=" bug from a couple weeks ago?) under heavy loads.  If I can reproduce it I'll report the details here (unless I can find it in the list archives).

Knowing that there is an easy solution to the only relatively major issue I've seen so far with 3.0.x is a big plus.

Thanks again!

PR

----- Original Message -----
From: "David Beer" <dbeer at adaptivecomputing.com>
To: "Torque Users Mailing List" <torqueusers at supercluster.org>
Sent: Monday, June 25, 2012 10:38:47 AM
Subject: Re: [torqueusers] Sporadic UID errors


You don't need to switch - this fix is in 4.* as well. 


David 


On Mon, Jun 25, 2012 at 9:36 AM, Phil Regier < pregier at ittc.ku.edu > wrote: 


Nice; that's pretty slick! I'm sure that will solve the problem; I'll switch back to 3.0.5 in a bit to try it out. 

Thanks! 

Phil 



----- Original Message ----- 
From: "David Beer" < dbeer at adaptivecomputing.com > 
To: "Torque Users Mailing List" < torqueusers at supercluster.org > 
Sent: Monday, June 25, 2012 10:01:29 AM 
Subject: Re: [torqueusers] Sporadic UID errors 


Phil, 

We have had other customers/users that had this problem due to LDAP failing sometimes. We added a retry parameter for the moms. You can set it in the mom's config file, just add the line: 

$ext_pwd_retry <num retries> 

If you don't really have users going to machines that they shouldn't go to, then you might want to set this to a fairly high number so that jobs aren't lost unnecessarily. 

David 


On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu > wrote: 


Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded? 



----- Original Message ----- 
From: "Phil Regier" < pregier at ittc.ku.edu > 
To: torqueusers at supercluster.org 
Sent: Friday, June 22, 2012 2:14:12 PM 
Subject: Sporadic UID errors 

Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives. 

I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example: 

... 
14289.localhost 
14290.localhost 
14291.localhost 
qsub: Bad UID for job execution MSG=User pregier does not exist in server password file 

14293.localhost 
14294.localhost 
14295.localhost 
... 


This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292. 

Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail? 

Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post. 

Phil Regier 
Student assistant system admininstrator 
University of Kansas, ITTC 
_______________________________________________ 
torqueusers mailing list 
torqueusers at supercluster.org 
http://www.supercluster.org/mailman/listinfo/torqueusers 



-- 

David Beer | Software Engineer 
Adaptive Computing 

_______________________________________________ 
torqueusers mailing list 
torqueusers at supercluster.org 
http://www.supercluster.org/mailman/listinfo/torqueusers 
_______________________________________________ 
torqueusers mailing list 
torqueusers at supercluster.org 
http://www.supercluster.org/mailman/listinfo/torqueusers 




-- 

David Beer | Software Engineer 
Adaptive Computing 

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list