[torqueusers] process using more CPUs than requested
knielson at adaptivecomputing.com
Thu Mar 3 11:49:38 MST 2011
On 03/02/2011 05:40 PM, Michael Jennings wrote:
> On Tuesday, 01 March 2011, at 09:24:35 (-0700),
> Ken Nielson wrote:
>> We need to talk about what the policies would be about its use. For
>> example, it is easy if you know when each job starts it will have
>> exclusive access to the machine. We would do a search and destroy on
>> all user processes before we start the job. But it becomes more
>> difficult when a node is shared. It becomes even more difficult when
>> the same user has multiple jobs running on the same node.
>> Please give us an idea of the use cases for which this feature would
>> be needed.
> Personally, I think the fundamental flaw in traditional approaches to
> this issue (prologue/epilogue/health check scripts) is that they're
> approaching the problem from a position of ignorance and must gather
> or assume all the information they require to perform their action
> (killing rogue jobs). This has a higher probability for error than
> approaching from a standpoint of fact, which only torque can provide
> ("who's *supposed* to be running on this node at this moment?").
The script Troy references (reaver) retrieves the information about who
should be running by looking on the mom_priv/jobs directory and
extracting user information from the .JB files. This probably tells the
script everything pbs_mom knows about who should be running. There may
be something else the MOM knows that is not available in the JOBS
directory. I will need to review the code to be sure.
> The PAM module is a valid prevention-based approach, which is
> definitely the "ideal" way to do it, but PAM configurations can be
> challenging to read and even harder to write accurately. It also
> doesn't account for the "formerly allowed, now denied" case (e.g., a
> user whose job has finished but which has managed to fork off some
> kiddies which are now running rampant).
Yes. PAM is good about stopping unauthorized users from getting started
but it does not stop authorized users when their time is up. There are
sites which allow users to simply use TORQUE to reserve nodes and then
they run their jobs using some other mechanism. The ability to police
this kind of user would be very useful.
> I definitely think there are scenarios where killing processes in
> hindsight would be useful, and the best way to approach that problem
> is to be armed with sufficient, accurate information. That means
> knowing what torque knows.
Unless there is information about running jobs the MOM knows that is not
already presented in the .JB file name a script may have the accurate
information it needs.
Just a side note. How about calling this cleanup method a sweeper.
More information about the torqueusers