[torqueusers] Updated: killbaduser, a tool to clean up rogue user processes

Garrick Staples garrick at usc.edu
Wed Nov 1 13:24:36 MST 2006


On Wed, Nov 01, 2006 at 08:55:56PM +0100, Ole Holm Nielsen alleged:
> Garrick Staples <garrick at clusterresources.com> wrote:
> >>We've been using killbaduser, a tool to clean up rogue user processes,
> >>> for a while now and it seems to do the job well.  I've made some
> >>> minor improvements to the bash script "killbaduser" version 1.3
> >>> (attached file, or available from ftp://ftp.fysik.dtu.dk/pub/PBS/).
> >>> 
> >>> This script should be executed on each individual Torque compute node,
> >>> either from a cron job, perhaps in the job prologue script (?), or from
> >>> the master server in a loop over all compute nodes.
> >
> >
> >Why does it ask the server instead of just checking the local job files?
> >That would seem much faster.
> 
> Good suggestion.  I've looked at that now, but it doesn't seem feasible 
> offhand.
> The Torque job spool files are, for example on a node:
> 
> # ls -1 /var/spool/torque/mom_priv/jobs/*.JB
> /var/spool/torque/mom_priv/jobs/7471.audhum.JB
> /var/spool/torque/mom_priv/jobs/7472.audhum.JB
> 
> So this node runs jobs 7471 and 7472, but qstat doesn't understand such
> numeric job-IDs:
> 
> # qstat -f 7471
> qstat: Unknown Job Id 7471.audhumbla1.dcsc.fysik.dtu.dk
> 
> In my case, this private network has the pbs_server at the address
> audhumbla1.dcsc.fysik.dtu.dk whereas the pbs_server's official
> job-ID would be 7471.audhumbla.fysik.dtu.dk.

The @ syntax should work in this case:

s=$(cat /var/spool/torque/server_name)
7471@$s

But that isn't necessary...

 
> I don't see how we can read the required Torque server and job information
> from a node without asking the pbs_server.  I would need that:
> 
> 1. qstat understands numeric job-IDs.
> 2. qstat on a node should print the "euser" variable.
> 
> Alternatively, it might be possible to extract the job-ID and user
> information from the *.JB files (it's in there, I've looked at an octal 
> dump).

Use 'printjob',
  printjob /var/spool/torque/mom_priv/jobs/7471.audhum.JB | grep euser

This is similar to what pam_pbssimpleauth.so does, it parses the JB
file to get this info.

Hrm, it might be interesting to add some command-line args to printjob
to extract specific attributes.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061101/bc0afb25/attachment.bin


More information about the torqueusers mailing list