[torqueusers] Sporadic UID errors
pregier at ittc.ku.edu
Fri Jun 22 14:48:19 MDT 2012
Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying 4.0.3 snapshot now), and it should also be noted that as part of the stress test I am constantly watching repeated qstats. The problem does not seem to appear with 4.0.x as such; might this be related to the switch from a single-threaded server to multi-threaded?
----- Original Message -----
From: "Phil Regier" <pregier at ittc.ku.edu>
To: torqueusers at supercluster.org
Sent: Friday, June 22, 2012 2:14:12 PM
Subject: Sporadic UID errors
Sorry if this has been raised (there is another LDAP thread active but I think the problem is very different) before; I'm still going through the archives.
I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible upgrade from 2.x and have come across some odd behaviors. In particular, when I submit 1000 small jobs to a fake one-node cluster running Torque 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can retrieve specfiles etc. if that would help) and authenticated against LDAP, I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never get accepted); for example:
qsub: Bad UID for job execution MSG=User pregier does not exist in server password file
This is just a loop; there is no difference between job 14291, 14293, and what should have been 14292.
Is this normal? Are there precautions to avoid it, or is this a bug I should be reporting in more detail?
Thanks for any suggestions; I'm not terribly experienced with Torque, so I'm not sure how quickly I should be bringing this sort of thing to the list. I can provide more details about my setup and/or stress tests, but didn't want to dump too much useless information in my first post.
Student assistant system admininstrator
University of Kansas, ITTC
More information about the torqueusers