[torqueusers] Wrong resources_used.cput report
Mathieu OUDART
Mathieu.Oudart at cnes.fr
Thu Sep 1 05:55:44 MDT 2005
Hi,
here are some additionnal information about this problem :
$ cat test.pbs
#!/bin/bash
cd $HOME
# 1800 seconds of CPU Load and 800MB of memory allocated
./okuseq -t 1800 -m 800
exit 0
$ qsub test.pbs
113.linux-ci
$ qstat -n
linux-ci:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
113.linux-ci gen_user q_gen_4h test.pbs 28759 1 -- 1gb 01:00 R --
l14-ci/0
$ pbsnodes -a l14-ci
l14-ci
state = free
np = 2
ntype = cluster
jobs = 0/113.linux-ci
status = arch=linux,uname=Linux l14-ci 2.6.11-6mdksmp #1 SMP Tue Mar 22 10:13:47 EST 2005 x86_64,sessions=28759,nsessions=1,nusers=1,idletime=12698,totmem=6148844kb,availmem=5088968kb,physmem=4060436kb,ncpus=2,loadave=0.96,netload=206810862,size=40210896kb:40243728kb,state=free,rectime=1125573966
$ checkjob -v 113
checking job 113 (RM job '113.linux-ci')
State: Running
Creds: user:gen_user2 group:nobody class:q_gen_4h qos:DEFAULT
WallTime: 00:03:57 of 8:00:00
SubmitTime: Thu Sep 1 13:22:38
(Time Queued Total: 00:00:02 Eligible: 00:00:02)
StartTime: Thu Sep 1 13:22:40
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 5120M Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1 MEM: 1024M DISK: 5120M
Utilized Resources Per Task: MEM: 8.03 SWAP: 8.19
Avg Util Resources Per Task: PROCS: 0.00
Max Util Resources Per Task: MEM: 8.03 SWAP: 8.19
Average Utilized Memory: 735.30 MB
Average Utilized Procs: 0.92
NodeAccess: SHARED
NodeCount: 1
Allocated Nodes:
[l14-ci:1]
Task Distribution: l14-ci
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '113' (-00:03:37 -> 7:56:23 Duration: 8:00:00)
PE: 129.38 StartPriority: 1
$ qstat -f 113
Job Id: 113.linux-ci
Job_Name = test.pbs
Job_Owner = gen_user2 at l13-ci
resources_used.cput = 00:00:00
resources_used.mem = 822972kb
resources_used.vmem = 839112kb
resources_used.walltime = 00:02:24
job_state = R
queue = q_gen_4h
server = linux-ci
Checkpoint = u
ctime = Thu Sep 1 13:22:38 2005
Error_Path = l13-ci:/Produits/tmp/torque/gen_user2/test.pbs.e113
exec_host = l14-ci/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = e
mtime = Thu Sep 1 13:22:40 2005
Output_Path = l13-ci:/Produits/tmp/torque/gen_user2/test.pbs.o113
Priority = 0
qtime = Thu Sep 1 13:22:38 2005
Rerunable = True
Resource_List.cput = 01:00:00
Resource_List.file = 5gb
Resource_List.mem = 1gb
Resource_List.nodect = 1
Resource_List.walltime = 08:00:00
session_id = 28759
Variable_List = PBS_O_HOME=/Produits/tmp/torque/gen_user2,PBS_O_LANG=fr_FR,
PBS_O_LOGNAME=gen_user2,
PBS_O_PATH=/Produits/publics/x86_64.Linux.2.6.11/bin:/bin:/usr/bin:/us
r/X11R6/bin:/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/bin:/u
sr/lib64/jdk-1.4.2/bin:/Produits/tmp/torque/gen_user2/bin,
PBS_O_MAIL=/var/spool/mail/gen_user2,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=l13-ci,PBS_O_WORKDIR=/Produits/tmp/torque/gen_user2,
LESSKEY=/etc/.less,LC_PAPER=fr_FR,
MANPATH=/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/man:/Prod
uits/publics/x86_64.Linux.2.6.11/man:/usr/share/man/fr:/usr/share/man:/
usr/X11R6/man/fr:/usr/X11R6/man:/usr/lib64/jdk-1.4.2/man,
PUBLICS=/Produits/publics/x86_64.Linux.2.6.11,LC_ADDRESS=fr_FR,
HOSTNAME=l13-ci,LC_MONETARY=fr_FR,TERM=xterm,SHELL=/bin/bash,
LC_SOURCED=1,HISTSIZE=1000,PBS_JOBNAME=shell.pbs,
PBS_ENVIRONMENT=PBS_INTERACTIVE,LC_NUMERIC=fr_FR,QTDIR=/usr/lib/qt3/,
OS=x86_64.Linux.2.6.11,PBS_O_WORKDIR=/Produits/tmp/torque/gen_user2,
PBS_TASKNUM=1,USER=gen_user2,
LD_LIBRARY_PATH=/Produits/publics/x86_64.Linux.2.6.11/lib,
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:
cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01
;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01
;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;
31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*
.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35
:*.png=01;35:*.tif=01;35:,LC_TELEPHONE=fr_FR,
FLASH_GTK_LIBRARY=libgtk-x11-2.0.so.0,
PBS_O_HOME=/Produits/tmp/torque/gen_user2,
DATACI=/data/nobody/gen_user2,PBS_MOMPORT=15003,
SCREENDIR=/Produits/tmp/torque/gen_user2/tmp,
MOZ_PLUGIN_PATH=/usr/lib/mozilla/plugins:/Produits/tmp/torque/gen_user
2/.mozilla/plugins,XDG_CONFIG_DIRS=/var/lib/menu-xdg,
PBS_O_QUEUE=q_gen_shell,NLSPATH=/usr/share/locale/%l/%N,
MAIL=/var/spool/mail/gen_user2,PBS_O_LOGNAME=gen_user2,
PATH=/Produits/publics/x86_64.Linux.2.6.11/bin:/bin:/usr/bin:/usr/X11R
6/bin:/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/bin:/usr/lib
64/jdk-1.4.2/bin:/Produits/tmp/torque/gen_user2/bin,LC_MESSAGES=fr_FR,
PBS_O_LANG=fr_FR,PBS_JOBCOOKIE=6183D7E18D158F2FE4AC6C106F413739,
SECURE_LEVEL=2,LC_IDENTIFICATION=fr_FR,LC_COLLATE=fr_FR,
INPUTRC=/etc/inputrc,PWD=/Produits/tmp/torque/gen_user2,
JAVA_HOME=/usr/lib64/jdk-1.4.2,PBS_NODENUM=0,LANG=fr_FR,
PYTHONSTARTUP=/etc/pythonrc.py,LC_MEASUREMENT=fr_FR,
XFILESEARCHPATH=/Produits/publics/x86_64.Linux.2.6.11/%T/%N%S,
PS1=[\\u@\\h \\W]\\$,PBS_O_SHELL=/bin/bash,PBS_JOBID=107.linux-ci,
HISTCONTROL=ignoredups,JDK_HOME=/usr/lib64/jdk-1.4.2,SHLVL=1,
HOME=/Produits/tmp/torque/gen_user2,LANGUAGE=fr_FR:fr,
GCONF_TMPDIR=/tmp,PBS_O_HOST=l13-ci,LESS=-MM,
G_FILENAME_ENCODING=@locale,LOGNAME=gen_user2,
XDG_DATA_DIRS=/var/lib/menu-xdg:/usr/share,PBS_QUEUE=q_gen_shell,
LC_CTYPE=fr_FR,TMPCI=/tmp,LESSOPEN=|/usr/bin/lesspipe.sh %s,
PBS_O_MAIL=/var/spool/mail/gen_user2,
INFOPATH=/Produits/publics/x86_64.Linux.2.6.11/info,DISPLAY=:0,
LC_TIME=fr_FR,PBS_NODEFILE=/var/spool/torque/1.2.0p5/aux/107.linux-ci,
XAUTHORITY=/Produits/tmp/torque/gen_user2/.Xauthority,LC_NAME=fr_FR,
PBS_O_PATH=/usr/local/home/sosiatis/cluster_build/torque-srv/sbin:/usr
/local/home/sosiatis/cluster_build/torque-client/sbin:/usr/local/home/s
osiatis/cluster_build/torque-client/bin:/usr/local/home/sosiatis/cluste
r_build/maui/sbin:/usr/local/home/sosiatis/cluster_build/maui/bin:/usr/
local/home/sosiatis/cluster_build/torque-mom/sbin:/usr/local/bin:/bin:/
usr/bin:/usr/X11R6/bin:/usr/games:/usr/lib64/jre-1.4.2/bin,
_=/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/bin/qsub,
PBS_O_QUEUE=q_gen_batch
etime = Thu Sep 1 13:22:38 2005
#
# ON THE MOM NODE :
#
# top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28829 gen_user 25 0 802m 800m 364 R 99.8 20.2 4:44.42 okuseq
# momctl -d 3
Host: l14-ci/l14-ci Server: linux-ci Version: torque_1.2.0p5
PID: 28635
HomeDirectory: /var/spool/torque/1.2.0p5/mom_priv
MOM active: 6600 seconds
Last Msg From Server: 30 seconds (StatusJob)
Last Msg To Server: 10 seconds
Server Update Interval: 20 seconds
Init Msgs Received: 0 hellos/4 cluster-addrs
Init Msgs Sent: 4 hellos
LOGLEVEL: 4 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
TCP Timeout: 20 seconds
Prolog Alarm Time: 300 seconds
Alarm Time: 0 of 10 seconds
Trusted Client List: 132.149.11.217,132.149.11.213,132.149.11.207,132.149.11.202,132.149.11.220,132.149.11.219,132.149.11.218,127.0.0.1
Job[113.linux-ci] State=RUNNING
diagnostics complete
#
# End of job
#
- Standard output :
+----------------------------------------
| Job ID : 113.linux-ci
| Username : gen_user2
| Group : nobody
| Job name : test.pbs
| Batch Queue : q_gen_4h
| Resources : cput=01:00:00,file=5gb,mem=1gb,neednodes=l14-ci,walltime=08:00:00
| Account :
+----------------------------------------
version : okuseq 2.0s
temps : 1 x 1800 = 1800 s
memoire : 1 x 800 = 800 Mo
MP_PROCS : 1
OMP_NUM_THREADS : (null)
0 : Debut
Temps reel : 1800.000000
Temps cpu user : 1798.970000
Temps cpu system : 0.660000
0 : Fin par timer
+----------------------------------------
| RESOURCES SUMMARY
+----------------------------------------
| Job ID : 113.linux-ci
| Job name : test.pbs
| Batch Queue : q_gen_4h
| Resources :
| Required : cput=01:00:00,file=5gb,mem=1gb,neednodes=l14-ci,walltime=08:00:00
| Used : cput=00:00:00,mem=822972kb,vmem=839112kb,walltime=00:30:02
+----------------------------------------
- End mail :
>From torque at linux-ci.cst.cnes.fr Thu Sep 1 13:52:47 2005
X-Original-To: gen_user2 at l13-ci.cst.cnes.fr
Delivered-To: gen_user2 at l13-ci.cst.cnes.fr
To: gen_user2 at l13-ci.cst.cnes.fr
Subject: PBS JOB 113.linux-ci
Precedence: bulk
Date: Thu, 1 Sep 2005 13:52:41 +0200 (CEST)
From: torque at linux-ci.cst.cnes.fr (root)
PBS Job Id: 113.linux-ci
Job Name: test.pbs
Execution terminated
Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=822972kb
resources_used.vmem=839112kb
resources_used.walltime=00:30:02
##############
As you can see, the job runs well, everything looks fine except the cputime detection.
I have nothing special in the mom/server logs, so I don't know what to do.
Any help would be appreciated !
Regards.
> Hi all,
>
> We actually use Torque 1.2.0p5 / Maui 3.2.6p14(snap) on our Linux cluster. All nodes have the same opteron based architecture. The Linux distribution used on all nodes is Mandriva LE 2005.
>
> The problem is that the CPU time used by batch jobs is not well reported (always zero) :
> cput=00:00:00,mem=4256kb,vmem=24428kb,walltime=00:01:01
>
> But it's not only a reporting problem, my cputime based policies do not work
>
> For example, I set "RESOURCELIMITPOLICY PROC:ALWAYS:CANCEL" in my Maui configuration,
> but any job asking for 10 cpu minutes are able to use several cpu hours without being cancelled.
>
> Does anyone have experienced this problem with the resources_used.cput ?
>
> Regards.
>
> --
> Mathieu OUDART
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list