[torqueusers] Wrong resources_used.cput report

Mathieu OUDART Mathieu.Oudart at cnes.fr
Thu Sep 1 05:55:44 MDT 2005


Hi,


here are some additionnal information about this problem :

$ cat test.pbs
#!/bin/bash

cd $HOME
# 1800 seconds of CPU Load and 800MB of memory allocated
./okuseq -t 1800 -m 800

exit 0

$ qsub test.pbs
113.linux-ci

$ qstat -n

linux-ci:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
113.linux-ci    gen_user q_gen_4h test.pbs    28759   1  --    1gb 01:00 R   --
   l14-ci/0

$ pbsnodes -a l14-ci
l14-ci
     state = free
     np = 2
     ntype = cluster
     jobs = 0/113.linux-ci
     status = arch=linux,uname=Linux l14-ci 2.6.11-6mdksmp #1 SMP Tue Mar 22 10:13:47 EST 2005 x86_64,sessions=28759,nsessions=1,nusers=1,idletime=12698,totmem=6148844kb,availmem=5088968kb,physmem=4060436kb,ncpus=2,loadave=0.96,netload=206810862,size=40210896kb:40243728kb,state=free,rectime=1125573966


$ checkjob -v 113


checking job 113 (RM job '113.linux-ci')

State: Running
Creds:  user:gen_user2  group:nobody  class:q_gen_4h  qos:DEFAULT
WallTime: 00:03:57 of 8:00:00
SubmitTime: Thu Sep  1 13:22:38
  (Time Queued  Total: 00:00:02  Eligible: 00:00:02)

StartTime: Thu Sep  1 13:22:40
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 5120M  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1  MEM: 1024M  DISK: 5120M
Utilized Resources Per Task:  MEM: 8.03  SWAP: 8.19
Avg Util Resources Per Task:  PROCS: 0.00
Max Util Resources Per Task:  MEM: 8.03  SWAP: 8.19
Average Utilized Memory: 735.30 MB
Average Utilized Procs: 0.92
NodeAccess: SHARED
NodeCount: 1
Allocated Nodes:
[l14-ci:1]
Task Distribution: l14-ci


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '113' (-00:03:37 -> 7:56:23  Duration: 8:00:00)
PE:  129.38  StartPriority:  1

$ qstat -f 113
Job Id: 113.linux-ci
    Job_Name = test.pbs
    Job_Owner = gen_user2 at l13-ci
    resources_used.cput = 00:00:00
    resources_used.mem = 822972kb
    resources_used.vmem = 839112kb
    resources_used.walltime = 00:02:24
    job_state = R
    queue = q_gen_4h
    server = linux-ci
    Checkpoint = u
    ctime = Thu Sep  1 13:22:38 2005
    Error_Path = l13-ci:/Produits/tmp/torque/gen_user2/test.pbs.e113
    exec_host = l14-ci/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = e
    mtime = Thu Sep  1 13:22:40 2005
    Output_Path = l13-ci:/Produits/tmp/torque/gen_user2/test.pbs.o113
    Priority = 0
    qtime = Thu Sep  1 13:22:38 2005
    Rerunable = True
    Resource_List.cput = 01:00:00
    Resource_List.file = 5gb
    Resource_List.mem = 1gb
    Resource_List.nodect = 1
    Resource_List.walltime = 08:00:00
    session_id = 28759
    Variable_List = PBS_O_HOME=/Produits/tmp/torque/gen_user2,PBS_O_LANG=fr_FR,
        PBS_O_LOGNAME=gen_user2,
        PBS_O_PATH=/Produits/publics/x86_64.Linux.2.6.11/bin:/bin:/usr/bin:/us
        r/X11R6/bin:/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/bin:/u
        sr/lib64/jdk-1.4.2/bin:/Produits/tmp/torque/gen_user2/bin,
        PBS_O_MAIL=/var/spool/mail/gen_user2,PBS_O_SHELL=/bin/bash,
        PBS_O_HOST=l13-ci,PBS_O_WORKDIR=/Produits/tmp/torque/gen_user2,
        LESSKEY=/etc/.less,LC_PAPER=fr_FR,
        MANPATH=/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/man:/Prod
        uits/publics/x86_64.Linux.2.6.11/man:/usr/share/man/fr:/usr/share/man:/
        usr/X11R6/man/fr:/usr/X11R6/man:/usr/lib64/jdk-1.4.2/man,
        PUBLICS=/Produits/publics/x86_64.Linux.2.6.11,LC_ADDRESS=fr_FR,
        HOSTNAME=l13-ci,LC_MONETARY=fr_FR,TERM=xterm,SHELL=/bin/bash,
        LC_SOURCED=1,HISTSIZE=1000,PBS_JOBNAME=shell.pbs,
        PBS_ENVIRONMENT=PBS_INTERACTIVE,LC_NUMERIC=fr_FR,QTDIR=/usr/lib/qt3/,
        OS=x86_64.Linux.2.6.11,PBS_O_WORKDIR=/Produits/tmp/torque/gen_user2,
        PBS_TASKNUM=1,USER=gen_user2,
        LD_LIBRARY_PATH=/Produits/publics/x86_64.Linux.2.6.11/lib,
        LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:
        cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01
        ;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01
        ;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;
        31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*
        .cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35
        :*.png=01;35:*.tif=01;35:,LC_TELEPHONE=fr_FR,
        FLASH_GTK_LIBRARY=libgtk-x11-2.0.so.0,
        PBS_O_HOME=/Produits/tmp/torque/gen_user2,
        DATACI=/data/nobody/gen_user2,PBS_MOMPORT=15003,
        SCREENDIR=/Produits/tmp/torque/gen_user2/tmp,
        MOZ_PLUGIN_PATH=/usr/lib/mozilla/plugins:/Produits/tmp/torque/gen_user
        2/.mozilla/plugins,XDG_CONFIG_DIRS=/var/lib/menu-xdg,
        PBS_O_QUEUE=q_gen_shell,NLSPATH=/usr/share/locale/%l/%N,
        MAIL=/var/spool/mail/gen_user2,PBS_O_LOGNAME=gen_user2,
        PATH=/Produits/publics/x86_64.Linux.2.6.11/bin:/bin:/usr/bin:/usr/X11R
        6/bin:/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/bin:/usr/lib
        64/jdk-1.4.2/bin:/Produits/tmp/torque/gen_user2/bin,LC_MESSAGES=fr_FR,
        PBS_O_LANG=fr_FR,PBS_JOBCOOKIE=6183D7E18D158F2FE4AC6C106F413739,
        SECURE_LEVEL=2,LC_IDENTIFICATION=fr_FR,LC_COLLATE=fr_FR,
        INPUTRC=/etc/inputrc,PWD=/Produits/tmp/torque/gen_user2,
        JAVA_HOME=/usr/lib64/jdk-1.4.2,PBS_NODENUM=0,LANG=fr_FR,
        PYTHONSTARTUP=/etc/pythonrc.py,LC_MEASUREMENT=fr_FR,
        XFILESEARCHPATH=/Produits/publics/x86_64.Linux.2.6.11/%T/%N%S,
        PS1=[\\u@\\h \\W]\\$,PBS_O_SHELL=/bin/bash,PBS_JOBID=107.linux-ci,
        HISTCONTROL=ignoredups,JDK_HOME=/usr/lib64/jdk-1.4.2,SHLVL=1,
        HOME=/Produits/tmp/torque/gen_user2,LANGUAGE=fr_FR:fr,
        GCONF_TMPDIR=/tmp,PBS_O_HOST=l13-ci,LESS=-MM,
        G_FILENAME_ENCODING=@locale,LOGNAME=gen_user2,
        XDG_DATA_DIRS=/var/lib/menu-xdg:/usr/share,PBS_QUEUE=q_gen_shell,
        LC_CTYPE=fr_FR,TMPCI=/tmp,LESSOPEN=|/usr/bin/lesspipe.sh %s,
        PBS_O_MAIL=/var/spool/mail/gen_user2,
        INFOPATH=/Produits/publics/x86_64.Linux.2.6.11/info,DISPLAY=:0,
        LC_TIME=fr_FR,PBS_NODEFILE=/var/spool/torque/1.2.0p5/aux/107.linux-ci,
        XAUTHORITY=/Produits/tmp/torque/gen_user2/.Xauthority,LC_NAME=fr_FR,
        PBS_O_PATH=/usr/local/home/sosiatis/cluster_build/torque-srv/sbin:/usr
        /local/home/sosiatis/cluster_build/torque-client/sbin:/usr/local/home/s
        osiatis/cluster_build/torque-client/bin:/usr/local/home/sosiatis/cluste
        r_build/maui/sbin:/usr/local/home/sosiatis/cluster_build/maui/bin:/usr/
        local/home/sosiatis/cluster_build/torque-mom/sbin:/usr/local/bin:/bin:/
        usr/bin:/usr/X11R6/bin:/usr/games:/usr/lib64/jre-1.4.2/bin,
        _=/Produits/publics/x86_64.Linux.2.6.11/torque/1.2.0p5/bin/qsub,
        PBS_O_QUEUE=q_gen_batch
    etime = Thu Sep  1 13:22:38 2005


#
# ON THE MOM NODE :
#

# top
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28829 gen_user  25   0  802m 800m  364 R 99.8 20.2   4:44.42 okuseq

# momctl -d 3

Host: l14-ci/l14-ci   Server: linux-ci   Version: torque_1.2.0p5
PID:                    28635
HomeDirectory:          /var/spool/torque/1.2.0p5/mom_priv
MOM active:             6600 seconds
Last Msg From Server:   30 seconds (StatusJob)
Last Msg To Server:     10 seconds
Server Update Interval: 20 seconds
Init Msgs Received:     0 hellos/4 cluster-addrs
Init Msgs Sent:         4 hellos
LOGLEVEL:               4 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
TCP Timeout:            20 seconds
Prolog Alarm Time:      300 seconds
Alarm Time:             0 of 10 seconds
Trusted Client List:    132.149.11.217,132.149.11.213,132.149.11.207,132.149.11.202,132.149.11.220,132.149.11.219,132.149.11.218,127.0.0.1
Job[113.linux-ci]  State=RUNNING

diagnostics complete

#
# End of job
#

- Standard output :

+----------------------------------------
| Job ID      : 113.linux-ci
| Username    : gen_user2
| Group       : nobody
| Job name    : test.pbs
| Batch Queue : q_gen_4h
| Resources   : cput=01:00:00,file=5gb,mem=1gb,neednodes=l14-ci,walltime=08:00:00
| Account     :
+----------------------------------------

version         : okuseq 2.0s
temps           : 1 x 1800 = 1800 s
memoire         : 1 x 800 = 800 Mo
MP_PROCS        : 1
OMP_NUM_THREADS : (null)
0 : Debut
Temps reel       : 1800.000000
Temps cpu user   : 1798.970000
Temps cpu system : 0.660000
0 : Fin par timer

+----------------------------------------
| RESOURCES SUMMARY
+----------------------------------------
| Job ID      : 113.linux-ci
| Job name    : test.pbs
| Batch Queue : q_gen_4h
| Resources :
|   Required  : cput=01:00:00,file=5gb,mem=1gb,neednodes=l14-ci,walltime=08:00:00
|   Used      : cput=00:00:00,mem=822972kb,vmem=839112kb,walltime=00:30:02
+----------------------------------------

- End mail :

>From torque at linux-ci.cst.cnes.fr  Thu Sep  1 13:52:47 2005
X-Original-To: gen_user2 at l13-ci.cst.cnes.fr
Delivered-To: gen_user2 at l13-ci.cst.cnes.fr
To: gen_user2 at l13-ci.cst.cnes.fr
Subject: PBS JOB 113.linux-ci
Precedence: bulk
Date: Thu,  1 Sep 2005 13:52:41 +0200 (CEST)
From: torque at linux-ci.cst.cnes.fr (root)

PBS Job Id: 113.linux-ci
Job Name:   test.pbs
Execution terminated
Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=822972kb
resources_used.vmem=839112kb
resources_used.walltime=00:30:02

##############

As you can see, the job runs well, everything looks fine except the cputime detection.
I have nothing special in the mom/server logs, so I don't know what to do.

Any help would be appreciated !

Regards.


> Hi all,
> 
> We actually use Torque 1.2.0p5 / Maui 3.2.6p14(snap) on our Linux cluster. All nodes have the same opteron based architecture. The Linux distribution used on all nodes is Mandriva LE 2005.
> 
> The problem is that the CPU time used by batch jobs is not well reported (always zero) :
> cput=00:00:00,mem=4256kb,vmem=24428kb,walltime=00:01:01
> 
> But it's not only a reporting problem, my cputime based policies do not work
> 
> For example, I set "RESOURCELIMITPOLICY  PROC:ALWAYS:CANCEL" in my Maui configuration,
> but any job asking for 10 cpu minutes are able to use several cpu hours without being cancelled.
> 
> Does anyone have experienced this problem with the resources_used.cput ?
> 
> Regards.
> 
> -- 
> Mathieu OUDART
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers




More information about the torqueusers mailing list