[torqueusers] usage of a cluster
chris at csamuel.org
Mon Feb 1 17:17:17 MST 2010
Can I suggest that many of these questions would be great
to be asked on the Beowulf mailing list, which is all about
Linux clusters in general. http://www.beowulf.org/
This list is mainly for questions about the Torque queuing
system. That said...
> 1. I noticed that there is no swap for any node of
> our cluster. Is it normal for most clusters?
Not really, people often argue that running without swap
is good but it does mean that the kernel does not have the
freedom to page dirty file-backed pages out to swap under
memory pressure, it has to evict them out to the files which
is (apparently) slower than paging them.
That's especially important if they're temporary files, they
could just get unlinked before those pages need to be evicted..
> I am runing my job on a node. What will happen to my job if
> the memory is used up? Do I have no other choice but to kill
> my job?
Usually the kernel will kill that process for you..
> The node finally runs out of memory and does not respond.
> I emailed it to the administrator and after a while I found
> the node is rebooted without affecting other nodes. Feel lucky
> my job did not bring down the whole cluster.
It shouldn't bring down the cluster, but if you were sharing the
node with other users jobs it would have killed them!
> Is using up memory one kind of behaviour what administrator
> dislikes from the users?
Very much so!
> 2. My jobs are sumitted by Torque. Will Torque make the
> newly submitted jobs waiting if there are not enough
> resources for run them?
Normally no - the scheduler decides what to run and usually
sites do not set up policies that overcommit the resources
> I wonder if each user still has to check the usage status
> of the cluster before deciding to submit new jobs? How to
No, the whole point of a queuing system is to manage a
situation where demand outstrips supply and so it has
to make the decisions on what to do, not you.
> By "qstat" I can see the jobs that are running and
> by "qstat -q" I can see how many jobs in each queue
> are running.
Correct - and if your site uses Maui or Moab as the
scheduler you can use "showq" to get even more info.
> But how can I find info about the usage percentage
> of all nodes and cores and memory to get a big picture
> and decide if I better not to submit my jobs but wait
> for more resources become available?
Firstly that's a site specific query - the tools they
use for monitoring nodes varies enormously.
Secondly the whole point of a queuing system is so you
don't have to worry about that - you submit it into the
queue and at some point (hopefully) it will run when the
resources are available.
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the torqueusers