[torqueusers] Sporadic UID errors
Phil Regier
pregier at ittc.ku.edu
Mon Jun 25 09:54:03 MDT 2012
I'm switching back and forth in an attempt to ascertain which upgrade would be less painful. There is a much uglier issue with 4.0.x that rears its head when I least expect it; I can't quite reproduce it consistently yet, but I'm seeing sporadic configuration database corruption (possibly related to Brian Andrus' "+=" bug from a couple weeks ago?) under heavy loads. If I can reproduce it I'll report the details here (unless I can find it in the list archives).
Knowing that there is an easy solution to the only relatively major issue I've seen so far with 3.0.x is a big plus.
Thanks again!
PR
----- Original Message -----
From: "David Beer" <dbeer at adaptivecomputing.com>
To: "Torque Users Mailing List" <torqueusers at supercluster.org>
Sent: Monday, June 25, 2012 10:38:47 AM
Subject: Re: [torqueusers] Sporadic UID errors
You don't need to switch - this fix is in 4.* as well.
David
On Mon, Jun 25, 2012 at 9:36 AM, Phil Regier < pregier at ittc.ku.edu > wrote:
Nice; that's pretty slick! I'm sure that will solve the problem; I'll switch back to 3.0.5 in a bit to try it out.
Thanks!
Phil
----- Original Message -----
From: "David Beer" < dbeer at adaptivecomputing.com >
To: "Torque Users Mailing List" < torqueusers at supercluster.org >
Sent: Monday, June 25, 2012 10:01:29 AM
Subject: Re: [torqueusers] Sporadic UID errors
Phil,
We have had other customers/users that had this problem due to LDAP failing sometimes. We added a retry parameter for the moms. You can set it in the mom's config file, just add the line:
$ext_pwd_retry <num retries>
If you don't really have users going to machines that they shouldn't go to, then you might want to set this to a fairly high number so that jobs aren't lost unnecessarily.
David
On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu > wrote:
Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded?
----- Original Message -----
From: "Phil Regier" < pregier at ittc.ku.edu >
To: torqueusers at supercluster.org
Sent: Friday, June 22, 2012 2:14:12 PM
Subject: Sporadic UID errors
Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives.
I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example:
...
14289.localhost
14290.localhost
14291.localhost
qsub: Bad UID for job execution MSG=User pregier does not exist in server password file
14293.localhost
14294.localhost
14295.localhost
...
This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292.
Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail?
Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post.
Phil Regier
Student assistant system admininstrator
University of Kansas, ITTC
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list