[torqueusers] Updated: killbaduser, a tool to clean up rogue user processes

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Nov 1 12:55:56 MST 2006


Garrick Staples <garrick at clusterresources.com> wrote:
>> We've been using killbaduser, a tool to clean up rogue user processes,
>> > for a while now and it seems to do the job well.  I've made some
>> > minor improvements to the bash script "killbaduser" version 1.3
>> > (attached file, or available from ftp://ftp.fysik.dtu.dk/pub/PBS/).
>> > 
>> > This script should be executed on each individual Torque compute node,
>> > either from a cron job, perhaps in the job prologue script (?), or from
>> > the master server in a loop over all compute nodes.
> 
> 
> Why does it ask the server instead of just checking the local job files?
> That would seem much faster.

Good suggestion.  I've looked at that now, but it doesn't seem feasible offhand.
The Torque job spool files are, for example on a node:

# ls -1 /var/spool/torque/mom_priv/jobs/*.JB
/var/spool/torque/mom_priv/jobs/7471.audhum.JB
/var/spool/torque/mom_priv/jobs/7472.audhum.JB

So this node runs jobs 7471 and 7472, but qstat doesn't understand such
numeric job-IDs:

# qstat -f 7471
qstat: Unknown Job Id 7471.audhumbla1.dcsc.fysik.dtu.dk

In my case, this private network has the pbs_server at the address
audhumbla1.dcsc.fysik.dtu.dk whereas the pbs_server's official
job-ID would be 7471.audhumbla.fysik.dtu.dk.

I don't see how we can read the required Torque server and job information
from a node without asking the pbs_server.  I would need that:

1. qstat understands numeric job-IDs.
2. qstat on a node should print the "euser" variable.

Alternatively, it might be possible to extract the job-ID and user
information from the *.JB files (it's in there, I've looked at an octal dump).

Thanks,
Ole


More information about the torqueusers mailing list