[torqueusers] User's job can mess up the system so that no jobs
glen.beane at gmail.com
Thu Sep 6 11:05:54 MDT 2007
On 9/6/07, Atwood, Robert C <r.atwood at imperial.ac.uk> wrote:
> I suppose this does not happen that often, since it's the first time in
> several years of using openPBS and then Torque that it has happened on
> my system ...
> One user submitted a malformed job of some kind that kept echoing a
> string to stdout. Eventually it filled up the disk partition containing
> /var/spool/torque . This happened on node01 (the first node in the list
> of available nodes)
> Subsequently, all users' jobs failed to run or return any stdout or
> stderr files, thus making it difficult to tell what the problem actually
> was. That's because the jobs were always getting directed to node01 as
> it was marked 'free'.
> Is there a good way within torque to prevent this behaviour? Apart from
> banning certain users that is!
we have some mechanisms in TORQUE to allow pbs_mom to mark itself
offline if it's spool directory becomes full, but I do not think this
is enabled by default.
In any case, I think the best approach to this problem would be to
make a node health check script that would mark a node as offline if
it finds this directory has been filled
More information about the torqueusers