[torqueusers] qtop: yet another tool to struggle with torque and PBS family systems
Fotis Georgatos
fotis at cern.ch
Fri Sep 3 04:27:52 MDT 2010
Hi Arnau,
thanks for your email.
On 03/09/2010 10:55, Arnau Bria wrote:
> Could you please give a brief explanation on how to interpret the array?
> *If you want, you could explain me the example on your web, so I don't
> send you my output and spam other users :-)
I think you need to try reading the NodeID vertically, spanning 3 lines.
qtop has indeed an issue with complex cluster naming schemes (eg. rocks-made
clusters)
and tries to "remap" them to fictional names wn101-wn999, in the pbsnodes -a
order.
I have found no other generic way to handle this, but if you know better
please comment.
Perhaps I'm trying too much to conserve space and should show the real node
name anyhow.
In short, the matrix dimensions should correspond to the product:
(number of nodes)x(number of cores per node)
Some sites have heterogeneous setup, with different number of cores per node,
so in that case it is advised to try instead the command: "DEBUG=yes ./qtop"
Now, inside the matrix you see some characters that correspond uniquely to
user accounts.
The symbol "0" (zero) corresponds always to the most active user (# of entries
in qstat output)
and so on for the rest of users; the series of used symbols is:
[0-9],[A-Z],[a-z] etc.
Now, let's read an example from here:
https://twiki.cscs.ch/twiki/bin/view/DECH/QTOP
Notice first the "Node state" line, with first letter of node state:
j=job-exclusive, d=down etc.
Find vertically ID 26, which is an "o"fflined node (notice the empty cores;
also node 35 is "o").
There you can see that atlasprd is "allocated" on CPU3, according to pbsnodes
-a output.
Actually that is an active bug of torque which requires human intervention, since
that node had been rebooted and there's no chance there is still a job running
there.
Yes, qtop is blindly passing information from pbsnodes/qstat commands.
Also, try yourself to answer the opposite problem: where is user's honeprd job
running at?
That output can be pretty handy, when you need to quickly chase a user job
within torque,
or you have a node misbehaving and need to find which users are related or
impacted by it.
Another way to use the tool is to try watch -d qtop, where you can see
resource (core)
allocation/deallocation happening dynamically. Fancy and useful when playing
with maui/moab.
Also, you get to see a very descriptive result when you have a black hole case.
I hope that was helpful.
btw.
If you want me to debug your system, let's go private and just send me the output
of pbsnodes -a/qstat -q/qstat as said on the twiki.
Thank you for your questions, I will eventually make an FAQ from them.
cheers,
Fotis
--
echo "sysadmin know better bash than english" | sed s/min/mins/ \
| sed 's/better bash/bash better/' # Yelling in a CERN forum
More information about the torqueusers
mailing list