[torqueusers] Torque Monthly Usage Accounting

etienne gondet etienne.gondet at mercator-ocean.fr
Fri Jan 6 10:41:45 MST 2006


Hi ,

Thank for the precision.

I said something inexact, pbsjobs extract the status but was not 
printing it in the JOBTEMP
file.

This new pbsacct correct the total user cput time per user and indicate 
the number of failed
jobs (don't ask me why they failed) .

I like to see the cput rather I agree walltime*nodect is a better 
accounting because on vector machine
they traditionnaly used CPUT and it also ndicates how poor is the ration 
CPU/walltime which mean the application
perform bad probably because of IO.

To be installed inside the classical pbsacct package.

    Etienne Gondet.

PS

Ole Holm Nielsen a écrit:

> Hi Etienne,
>
> Thanks for your comments.  You're right about the multiplication of cput
> by nodect being incorrect.  The point was with PBSPro we didn't have the
> TM interface, so parallel MPI jobs would only count the master node
> CPU time (PBSPro), whereas with Torque+TM you get the correct
> total CPU time on all nodes *provided* that your application does
> use the TM interface !  Your modification is only valid under
> this assumption, which will not always be satisfied.
>
> That's why I've never bothered with accounting for CPU time, but
> only with wallclock time !  IMHO, it's fair to charge users for
> the walltime they reserve a certain number of nodes, rather than for
> their CPU time which may be the result of terribly inefficient
> use of the resources.
>
> Your second point about Exit_status is a good one.  Maybe you
> can propose a nice, compact and useful output format for
> pbsacct which includes failed jobs ?  If we agree on a good
> format, I could release a new version of the pbsacct tools.
>
> Best regards,
> Ole
>
> etienne gondet wrote:
>
>> I just had a try to pbsacct. It's just the easy tools I was looking for.
>>
>> I tried to add total cumulated cpu  and I believe there is a mistake 
>> in the cpu computation.
>>
>> In pbsjobs : cput is computed according to the value of 
>> resources_used.cput
>> which is the total cpu of cput over all nodes and ppn ? Anybody can 
>> confirm this point.
>
> ...
>
>> #ETG modif for SBU = walltime*NCPUS
>>     # nodect = total_ncpus
>>     nodect = total_ncpus
>
> ...
>
>> #ETG    cpunodes[user] += nodect*cput
>>     cpunodes[user] += cput
>
> ...
>
>> #ETG    cpunodesecs += nodect*cput
>>     cpunodesecs += cput
>
>
>

-------------- next part --------------
#!/bin/sh

# Summarize USER accounting information from PBS accounting files
# located in $PBSHOME/server_priv/accounting/

# The accompanying script "pbsjobs" extracts simplified records
# of completed jobs.

# Usage: pbsacct <accounting-files>
# where <accounting-files> are daily PBS records (such as 20000705)
# Author:	Ole.H.Nielsen at fysik.dtu.dk
# Thanks to:	Miroslaw.Prywata at fuw.edu.pl

#---------------------------------------------------------------

#BINDIR=/usr/local/bin
BINDIR=/home/mercator/64/bin
GROUPID=""

if [ -z "$1" ] ; then
	echo "Usage: $0 [-g groupid] accounting-files";
	exit 1
fi

# 
case $1 in
	-g) GROUPID=$2
	    shift; shift;
esac

# Accounting-files:
ACCT_FILES=$*
NUM_FILES=$#
# Sanity check
for f in ${ACCT_FILES}
do
	if [ ! -r $f ]
	then
		echo ERROR: File $f is unreadable:
		ls -la $f
		exit 1
	fi
done

# The pbsjobs accounting-information extractor script:
# May be set by an environment variable.
if [ -z "${PBSJOBS}" ] ; then
	PBSJOBS="${BINDIR}/pbsjobs";
fi
if [ ! -x "${PBSJOBS}" ] ; then
	echo No ${PBSJOBS} executable found
	exit 1
fi

# A working file
JOBTEMP=/tmp/pbsjobs.$$
# Trap error signals:
trap "rm -f ${JOBTEMP}; exit 2" 1 2 3 14 15 19

#---------------------------------------------------------------

# List the input files 
echo
echo "Portable Batch System USER accounting statistics"
echo "------------------------------------------------"
echo
echo A total of $NUM_FILES accounting files will be processed.

rm -f ${JOBTEMP}
cat ${ACCT_FILES} | ${PBSJOBS} > ${JOBTEMP}

cat ${JOBTEMP} | awk '
{
	if (NR == 1) firstdate=$7
	lastdate=$7
} END {
	printf("The first record is dated %s, last record is dated %s.\n",
		firstdate, lastdate)
}'

#---------------------------------------------------------------

echo
echo "                    jobs Number      Wallclock          Average Average  CPU"
echo "Username    Group   bad     total    hours      Percent  #nodes  q-days  hours"
echo "--------    -----   ------- -------- ---------  ------- ------- -------  -----"

cat ${JOBTEMP} | awk -vGROUPID=$GROUPID '
{
	# Parse input data
	user	= $2		# User name
	group	= $3		# Group name
	queue	= $4		# Queue name
	nodect	= $5		# Number of nodes used
	cput	= $6		# CPU time in seconds
	wall	= $9		# Wallclock time in seconds
	wait	= $11		# Waiting time in seconds
	total_ncpus = $12	# Total number of CPUs used (>=nodect)
	status = $13	        # status 

	#
	# For accounting by number of CPUs in stead of number of nodes:
	# Uncomment the following line:
#ETG modif for SBU = walltime*NCPUS
	# nodect = total_ncpus
	nodect = total_ncpus

	username[user] = user
	groupname[user] = group
	jobs[user]++
	if (status != 0) failedjobs[user]++ 
	if (status != 0) wrongjobs++ 
#ETG	cpunodes[user] += nodect*cput
	cpunodes[user] += cput
	wallnodes[user] += nodect*wall
	wallcpu[user] += wall
	if (nodect < minnodes[user]) minnodes[user] = nodect
	if (nodect > maxnodes[user]) maxnodes[user] = nodect
	waittime[user] += wait
	totaljobs++
	totalwait += wait
#ETG	cpunodesecs += nodect*cput
	cpunodesecs += cput
	wallnodesecs += nodect*wall
	wallsecs += wall
} END {
	cpunodedays = cpunodesecs / 86400
	wallnodedays = wallnodesecs / 86400
	walldays = wallsecs / 86400
	groupjobs = 0
	groupdays = 0
	for (user in username) {
		if (length(GROUPID) > 0 && groupname[user] != GROUPID) continue
		if (wallcpu[user] > 0)
			printf("%10s %8s %7d %7d  %8.2f  %6.2f %7.2f %7.2f %8.2f\n",
			username[user], groupname[user],  failedjobs[user], jobs[user], 
			wallnodes[user]/3600, wallnodes[user]/(864*wallnodedays),
			wallnodes[user]/wallcpu[user], waittime[user]/jobs[user]/36400,
                        cpunodes[user]/3600)
		groupjobs += jobs[user]
		groupnodedays += wallnodes[user]/86400
		groupdays += wallcpu[user]/86400
		groupwait += waittime[user]
	}
	printf("%10s %8s %7d  %7d %8.2f  %6.2f %7.2f %7.2f %8.2f\n",
		"TOTAL", "-", wrongjobs,totaljobs, wallnodesecs/3600, 100,
		wallnodedays/walldays, totalwait/totaljobs/86400, cpunodesecs/3600)
	if (length(GROUPID) > 0 && groupjobs > 0)
		printf("%10s %8s %7d  %8.2f  %7.2f %7.2f %7.2f \n",
			"GROUP", GROUPID, groupjobs, groupnodedays,
			100*groupnodedays/wallnodedays,
			groupnodedays/groupdays, groupwait/groupjobs/86400)
		
} ' | sort -r -n +3 

#rm -f ${JOBTEMP}
exit 0
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pbsjobs
Type: application/x-java-applet
Size: 3695 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060106/238b387f/pbsjobs.bin


More information about the torqueusers mailing list