[torqueusers] Updated: killbaduser,
a tool to clean up rogue user processes
Garrick Staples
garrick at usc.edu
Wed Nov 1 13:24:36 MST 2006
On Wed, Nov 01, 2006 at 08:55:56PM +0100, Ole Holm Nielsen alleged:
> Garrick Staples <garrick at clusterresources.com> wrote:
> >>We've been using killbaduser, a tool to clean up rogue user processes,
> >>> for a while now and it seems to do the job well. I've made some
> >>> minor improvements to the bash script "killbaduser" version 1.3
> >>> (attached file, or available from ftp://ftp.fysik.dtu.dk/pub/PBS/).
> >>>
> >>> This script should be executed on each individual Torque compute node,
> >>> either from a cron job, perhaps in the job prologue script (?), or from
> >>> the master server in a loop over all compute nodes.
> >
> >
> >Why does it ask the server instead of just checking the local job files?
> >That would seem much faster.
>
> Good suggestion. I've looked at that now, but it doesn't seem feasible
> offhand.
> The Torque job spool files are, for example on a node:
>
> # ls -1 /var/spool/torque/mom_priv/jobs/*.JB
> /var/spool/torque/mom_priv/jobs/7471.audhum.JB
> /var/spool/torque/mom_priv/jobs/7472.audhum.JB
>
> So this node runs jobs 7471 and 7472, but qstat doesn't understand such
> numeric job-IDs:
>
> # qstat -f 7471
> qstat: Unknown Job Id 7471.audhumbla1.dcsc.fysik.dtu.dk
>
> In my case, this private network has the pbs_server at the address
> audhumbla1.dcsc.fysik.dtu.dk whereas the pbs_server's official
> job-ID would be 7471.audhumbla.fysik.dtu.dk.
The @ syntax should work in this case:
s=$(cat /var/spool/torque/server_name)
7471@$s
But that isn't necessary...
> I don't see how we can read the required Torque server and job information
> from a node without asking the pbs_server. I would need that:
>
> 1. qstat understands numeric job-IDs.
> 2. qstat on a node should print the "euser" variable.
>
> Alternatively, it might be possible to extract the job-ID and user
> information from the *.JB files (it's in there, I've looked at an octal
> dump).
Use 'printjob',
printjob /var/spool/torque/mom_priv/jobs/7471.audhum.JB | grep euser
This is similar to what pam_pbssimpleauth.so does, it parses the JB
file to get this info.
Hrm, it might be interesting to add some command-line args to printjob
to extract specific attributes.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061101/bc0afb25/attachment.bin
More information about the torqueusers
mailing list