[torqueusers] qtop: yet another tool to struggle with torque and PBS family systems

Fotis Georgatos fotis at cern.ch
Fri Sep 3 04:27:52 MDT 2010


Hi Arnau,

thanks for your email.

On 03/09/2010 10:55, Arnau Bria wrote:
 > Could you please give a brief explanation on how to interpret the array?
 > *If you want, you could explain me the example on your web, so I don't
 > send you my output and spam other users :-)

I think you need to try reading the NodeID vertically, spanning 3 lines.

qtop has indeed an issue with complex cluster naming schemes (eg. rocks-made 
clusters)
and tries to "remap" them to fictional names wn101-wn999, in the pbsnodes -a 
order.
I have found no other generic way to handle this, but if you know better 
please comment.
Perhaps I'm trying too much to conserve space and should show the real node 
name anyhow.

In short, the matrix dimensions should correspond to the product:
(number of nodes)x(number of cores per node)
Some sites have heterogeneous setup, with different number of cores per node,
so in that case it is advised to try instead the command: "DEBUG=yes ./qtop"

Now, inside the matrix you see some characters that correspond uniquely to 
user accounts.
The symbol "0" (zero) corresponds always to the most active user (# of entries 
in qstat output)
and so on for the rest of users; the series of used symbols is: 
[0-9],[A-Z],[a-z] etc.

Now, let's read an example from here: 
https://twiki.cscs.ch/twiki/bin/view/DECH/QTOP
Notice first the "Node state" line, with first letter of node state: 
j=job-exclusive, d=down etc.
Find vertically ID 26, which is an "o"fflined node (notice the empty cores; 
also node 35 is "o").
There you can see that atlasprd is "allocated" on CPU3, according to pbsnodes 
-a output.
Actually that is an active bug of torque which requires human intervention, since
that node had been rebooted and there's no chance there is still a job running 
there.
Yes, qtop is blindly passing information from pbsnodes/qstat commands.

Also, try yourself to answer the opposite problem: where is user's honeprd job 
running at?
That output can be pretty handy, when you need to quickly chase a user job 
within torque,
or you have a node misbehaving and need to find which users are related or 
impacted by it.

Another way to use the tool is to try watch -d qtop, where you can see 
resource (core)
allocation/deallocation happening dynamically. Fancy and useful when playing 
with maui/moab.
Also, you get to see a very descriptive result when you have a black hole case.

I hope that was helpful.

btw.
If you want me to debug your system, let's go private and just send me the output
of pbsnodes -a/qstat -q/qstat as said on the twiki.

Thank you for your questions, I will eventually make an FAQ from them.

cheers,
Fotis

-- 
echo "sysadmin know better bash than english" | sed s/min/mins/ \
	| sed 's/better bash/bash better/' # Yelling in a CERN forum


More information about the torqueusers mailing list