[torqueusers] Updated: killbaduser, a tool to clean up rogue
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Nov 1 12:55:56 MST 2006
Garrick Staples <garrick at clusterresources.com> wrote:
>> We've been using killbaduser, a tool to clean up rogue user processes,
>> > for a while now and it seems to do the job well. I've made some
>> > minor improvements to the bash script "killbaduser" version 1.3
>> > (attached file, or available from ftp://ftp.fysik.dtu.dk/pub/PBS/).
>> > This script should be executed on each individual Torque compute node,
>> > either from a cron job, perhaps in the job prologue script (?), or from
>> > the master server in a loop over all compute nodes.
> Why does it ask the server instead of just checking the local job files?
> That would seem much faster.
Good suggestion. I've looked at that now, but it doesn't seem feasible offhand.
The Torque job spool files are, for example on a node:
# ls -1 /var/spool/torque/mom_priv/jobs/*.JB
So this node runs jobs 7471 and 7472, but qstat doesn't understand such
# qstat -f 7471
qstat: Unknown Job Id 7471.audhumbla1.dcsc.fysik.dtu.dk
In my case, this private network has the pbs_server at the address
audhumbla1.dcsc.fysik.dtu.dk whereas the pbs_server's official
job-ID would be 7471.audhumbla.fysik.dtu.dk.
I don't see how we can read the required Torque server and job information
from a node without asking the pbs_server. I would need that:
1. qstat understands numeric job-IDs.
2. qstat on a node should print the "euser" variable.
Alternatively, it might be possible to extract the job-ID and user
information from the *.JB files (it's in there, I've looked at an octal dump).
More information about the torqueusers