[torqueusers] process using more CPUs than requested
mej at lbl.gov
Wed Mar 2 17:40:14 MST 2011
On Tuesday, 01 March 2011, at 09:24:35 (-0700),
Ken Nielson wrote:
> We need to talk about what the policies would be about its use. For
> example, it is easy if you know when each job starts it will have
> exclusive access to the machine. We would do a search and destroy on
> all user processes before we start the job. But it becomes more
> difficult when a node is shared. It becomes even more difficult when
> the same user has multiple jobs running on the same node.
> Please give us an idea of the use cases for which this feature would
> be needed.
Personally, I think the fundamental flaw in traditional approaches to
this issue (prologue/epilogue/health check scripts) is that they're
approaching the problem from a position of ignorance and must gather
or assume all the information they require to perform their action
(killing rogue jobs). This has a higher probability for error than
approaching from a standpoint of fact, which only torque can provide
("who's *supposed* to be running on this node at this moment?").
The PAM module is a valid prevention-based approach, which is
definitely the "ideal" way to do it, but PAM configurations can be
challenging to read and even harder to write accurately. It also
doesn't account for the "formerly allowed, now denied" case (e.g., a
user whose job has finished but which has managed to fork off some
kiddies which are now running rampant).
I definitely think there are scenarios where killing processes in
hindsight would be useful, and the best way to approach that problem
is to be armed with sufficient, accurate information. That means
knowing what torque knows.
I'm not sure I answered your question, so let me know if I didn't! :)
Michael Jennings <mej at lbl.gov>
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E W: 510-495-2687
MS 050C-3396 F: 510-486-8615
More information about the torqueusers