[torqueusers] How to clean up rogue user processes ?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu Oct 5 09:46:03 MDT 2006


Thanks, Troy !  Not being a Perl programmer myself, I was inspired
by your script to write a similar bash script "killbaduser" (attached,
or available from ftp://ftp.fysik.dtu.dk/pub/PBS/).  This script
should be executed on each individual Torque compute node, either
from a cron job, perhaps in the job prologue script (?), or on
the master server in a loop over all compute nodes.  If anyone
else likes this script, I'd appreciate some feedback.

Troy Baer wrote:
>> Does anyone have a script/tool for cleaning up user processes
>> on nodes in the case where the user ought not to have any processes
>> running on that node ?
> 
> Take a look at http://svn.osc.edu/repos/pbstools/trunk/sbin/reaver and
> see if it does what you need.

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
-------------- next part --------------
#!/bin/sh

#
# On a Torque/PBS compute node, list and kill any user processes not belonging to batch jobs.
#
# Usage: killbaduser [-k] [-s]
#    -k will execute the kill command 
#    -s will sleep a random number of seconds so the pbs_server doesn't get overloaded
#

###  CONFIGURE:  ###
# The list of OK system user-ids:
USERLIST="root rpc rpcuser daemon ntp smmsp sshd hpsmh named dbus"
# Don't kill processes with UID < UIDMIN
UIDMIN=250

###  CONFIGURE:  ###
# Commands which we use:
PBSNODES=pbsnodes
QSTAT=qstat

#
# Process command options
#
DOKILL=0
DOSLEEP=0
while getopts "ks" options; do
	case $options in
		k ) DOKILL=1;;
		s ) DOSLEEP=1;;
		* ) echo Usage: $0 "[-k] [-s]"
			exit 1;;
	esac
done

# Get the Torque nodename for this node.
# Strip the domain name (would be nice if there existed a Torque function for the current nodename)
NODENAME=`echo $HOSTNAME | awk -F. '{print $1}'`
# echo $NODENAME

#
# Sleep a random number of seconds so Torque server doesn't get overloaded
# if all nodes run this script simultaneously.
#
if test ${DOSLEEP} -eq 1
then
	# Initialize /bin/bash built-in random number generator with PID
	RANDOM=$$
	MAXSLEEP=10
	INTERVAL=$((${RANDOM}%${MAXSLEEP}))
	# echo Sleep $INTERVAL
	sleep $INTERVAL
fi

#
# Get job list on this node and write one line for each unique job
#
JOBLIST=`$PBSNODES -a $NODENAME | grep 'jobs = ' | sed -e s/,//g -e 's/     jobs = //' -e 's/[0-9]\///g' | tr ' ' '\n' | uniq`
# echo $JOBLIST

# get batch job user-ids and append to USERLIST
for job in $JOBLIST
do
	# Get the user-id from the Job_Owner attribute
	# (the "euser" variable seems to be unavailable on Torque compute nodes).
	EUSER=`$QSTAT -f $job | grep 'Job_Owner =' | awk '{print $3}' | awk -F@ '{print $1}'`
	# echo Job $job user $EUSER
	USERLIST="$USERLIST $EUSER"
done
# echo USERLIST: $USERLIST

#
# Get the process list and deselect acceptable user-ids.
#

ps --no-headers --deselect -u "$USERLIST" -o pid,state,uid,user,command

#
# Kill rogue user processes
#
if test ${DOKILL} -eq 1
then
	ps --no-headers --deselect -u "$USERLIST" -o pid,state,uid,user,command | awk -v UIDMIN=$UIDMIN '
	{
		PID=$1; UID=$3
		if (UID > $UIDMIN) PIDLIST = PIDLIST sprintf("%d ", PID)
	} END {
		if (length(PIDLIST) > 0) {
			# Troy Baer safe version: SIGCONT; sleep; SIGTERM; sleep; SIGKILL
			system(sprintf("kill -s CONT %s", PIDLIST))
			system(sprintf("sleep 1; kill -s TERM %s", PIDLIST))
			system(sprintf("sleep 5; kill -s KILL %s", PIDLIST))
		}
	}'
fi


More information about the torqueusers mailing list