[torqueusers] Torque Monthly Usage Accounting
etienne gondet
etienne.gondet at mercator-ocean.fr
Fri Jan 6 08:09:10 MST 2006
Yes should be usefull to point out problematic users
In the accounting record : Exit_status if differrent of 0 must indicate
something wrong
Exit status consideration is not in pbsjobs and pbsacct but should be added.
with 2 columns : number of jobs and number of jobs with non 0 exit_status
Etienne Gondet
PS : I will try to add that.
etienne gondet a écrit:
>
> hello,
>
> I just had a try to pbsacct. It's just the easy tools I was looking for.
>
> I tried to add total cumulated cpu and I believe there is a mistake
> in the cpu computation.
>
> In pbsjobs : cput is computed according to the value of
> resources_used.cput
> which is the total cpu of cput over all nodes and ppn ? Anybody can
> confirm this point.
>
> Wallclock Average Average
> CPU
> Username Group #jobs hours Percent #nodes q-days hours
> -------- ----- ----- --------- ------- ------- ------- -----
> TOTAL - 1876 8248.34 100.00 4.41 0.00 3017.38
> user1 red 745 3538.88 42.90 6.00 0.00 1229.49
> user2 red 285 2382.64 28.89 2.99 0.00 1103.90
>
> But in pbsacct you remultiply by the number of nodes
> line 108 cpunodes[user] += nodect*cput
> line 116 cpunodesecs += nodect*cput
>
> So I guess the following should have been more accurate.
> line 108 cpunodes[user] += nodect*cput
> line 116 cpunodesecs += nodect*cput
>
>
> If I look an accounting record resources_used.cput=01:41:36 is > to
> resources_used.walltime=00:51:22
> That's why i thik it's already the cmulated VCPU over all the
> processors nodes*ppn.
>
> 01/05/2006 02:18:34;E;20020.baltic;user=mbenkiran group=mercator
> jobname=SAM1V2_UV queue=long ctime=1136424430 qtime=1136424431
> etime=1136424431 start=1136424432
> exec_host=baltic-05/1+baltic-05/0+baltic-04/1+baltic-04/0+baltic-03/1+baltic-03/0
> Resource_List.cput=12:30:00 Resource_List.neednodes=3:ppn=2
> Resource_List.nodect=3 Resource_List.nodes=3:ppn=2
> Resource_List.pcput=03:00:00 Resource_List.pmem=5888mb
> Resource_List.pvmem=5888mb Resource_List.walltime=03:00:00 session=0
> end=1136427514 Exit_status=0 resources_used.cput=01:41:36
> resources_used.mem=4095000kb resources_used.vmem=3003072kb
> resources_used.walltime=00:51:22
>
> Happy new years to all torque users.
>
> Ole Holm Nielsen a écrit:
>
>> hpc.group at gmail.com wrote:
>>
>>> Does anyone know how to generate an accurate torque monthly usage
>>> report
>>> based on cpu number, not number of nodes for cluster and SMP
>>> machine? The
>>> report will include userid, group, wall-clock (hours), cpu time (hours)
>>> and cpu number. Pls let me know, thanks.
>>
>>
>>
>> I wrote some really simple PBS accounting scripts for PBS (Torque and
>> PBSPro)
>> some years ago, and this is what we still use. You may download the
>> pbsacct
>> package from ftp://ftp.fysik.dtu.dk/pub/PBS/
>>
>> Regards,
>> Ole
>>
>
>------------------------------------------------------------------------
>
>#!/bin/sh
>
># Summarize USER accounting information from PBS accounting files
># located in $PBSHOME/server_priv/accounting/
>
># The accompanying script "pbsjobs" extracts simplified records
># of completed jobs.
>
># Usage: pbsacct <accounting-files>
># where <accounting-files> are daily PBS records (such as 20000705)
># Author: Ole.H.Nielsen at fysik.dtu.dk
># Thanks to: Miroslaw.Prywata at fuw.edu.pl
>
>#---------------------------------------------------------------
>
>#BINDIR=/usr/local/bin
>BINDIR=/home/mercator/64/bin
>GROUPID=""
>
>if [ -z "$1" ] ; then
> echo "Usage: $0 [-g groupid] accounting-files";
> exit 1
>fi
>
>#
>case $1 in
> -g) GROUPID=$2
> shift; shift;
>esac
>
># Accounting-files:
>ACCT_FILES=$*
>NUM_FILES=$#
># Sanity check
>for f in ${ACCT_FILES}
>do
> if [ ! -r $f ]
> then
> echo ERROR: File $f is unreadable:
> ls -la $f
> exit 1
> fi
>done
>
># The pbsjobs accounting-information extractor script:
># May be set by an environment variable.
>if [ -z "${PBSJOBS}" ] ; then
> PBSJOBS="${BINDIR}/pbsjobs";
>fi
>if [ ! -x "${PBSJOBS}" ] ; then
> echo No ${PBSJOBS} executable found
> exit 1
>fi
>
># A working file
>JOBTEMP=/tmp/pbsjobs.$$
># Trap error signals:
>trap "rm -f ${JOBTEMP}; exit 2" 1 2 3 14 15 19
>
>#---------------------------------------------------------------
>
># List the input files
>echo
>echo "Portable Batch System USER accounting statistics"
>echo "------------------------------------------------"
>echo
>echo A total of $NUM_FILES accounting files will be processed.
>
>rm -f ${JOBTEMP}
>cat ${ACCT_FILES} | ${PBSJOBS} > ${JOBTEMP}
>
>cat ${JOBTEMP} | awk '
>{
> if (NR == 1) firstdate=$7
> lastdate=$7
>} END {
> printf("The first record is dated %s, last record is dated %s.\n",
> firstdate, lastdate)
>}'
>
>#---------------------------------------------------------------
>
>echo
>echo " Wallclock Average Average CPU"
>echo "Username Group #jobs hours Percent #nodes q-days hours"
>echo "-------- ----- ----- --------- ------- ------- ------- -----"
>
>cat ${JOBTEMP} | awk -vGROUPID=$GROUPID '
>{
> # Parse input data
> user = $2 # User name
> group = $3 # Group name
> queue = $4 # Queue name
> nodect = $5 # Number of nodes used
> cput = $6 # CPU time in seconds
> wall = $9 # Wallclock time in seconds
> wait = $11 # Waiting time in seconds
> total_ncpus = $12 # Total number of CPUs used (>=nodect)
>
> #
> # For accounting by number of CPUs in stead of number of nodes:
> # Uncomment the following line:
>#ETG modif for SBU = walltime*NCPUS
> # nodect = total_ncpus
> nodect = total_ncpus
>
> username[user] = user
> groupname[user] = group
> jobs[user]++
>#ETG cpunodes[user] += nodect*cput
> cpunodes[user] += cput
> wallnodes[user] += nodect*wall
> wallcpu[user] += wall
> if (nodect < minnodes[user]) minnodes[user] = nodect
> if (nodect > maxnodes[user]) maxnodes[user] = nodect
> waittime[user] += wait
> totaljobs++
> totalwait += wait
>#ETG cpunodesecs += nodect*cput
> cpunodesecs += cput
> wallnodesecs += nodect*wall
> wallsecs += wall
>} END {
> cpunodedays = cpunodesecs / 86400
> wallnodedays = wallnodesecs / 86400
> walldays = wallsecs / 86400
> groupjobs = 0
> groupdays = 0
> for (user in username) {
> if (length(GROUPID) > 0 && groupname[user] != GROUPID) continue
> if (wallcpu[user] > 0)
> printf("%10s %8s %7d %8.2f %6.2f %7.2f %7.2f %8.2f\n",
> username[user], groupname[user], jobs[user],
> wallnodes[user]/3600, wallnodes[user]/(864*wallnodedays),
> wallnodes[user]/wallcpu[user], waittime[user]/jobs[user]/36400,
> cpunodes[user]/3600)
> groupjobs += jobs[user]
> groupnodedays += wallnodes[user]/86400
> groupdays += wallcpu[user]/86400
> groupwait += waittime[user]
> }
> printf("%10s %8s %7d %8.2f %6.2f %7.2f %7.2f %8.2f\n",
> "TOTAL", "-", totaljobs, wallnodesecs/3600, 100,
> wallnodedays/walldays, totalwait/totaljobs/86400, cpunodesecs/3600)
> if (length(GROUPID) > 0 && groupjobs > 0)
> printf("%10s %8s %7d %8.2f %7.2f %7.2f %7.2f \n",
> "GROUP", GROUPID, groupjobs, groupnodedays,
> 100*groupnodedays/wallnodedays,
> groupnodedays/groupdays, groupwait/groupjobs/86400)
>
>} ' | sort -r -n +3
>
>rm -f ${JOBTEMP}
>exit 0
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060106/b58b16c0/attachment-0001.html
More information about the torqueusers
mailing list