[torqueusers] 4GB resources_used.mem limit
Bernd Schubert
bernd-schubert at gmx.de
Thu Jun 30 16:54:31 MDT 2005
Dear Garrick,
many many thanks for your help!
> I'm not able to test this. But the first thing you need to do is figure
> out if pbs_mom is reporting the wrong info, or if pbs_server is breaking
> it.
This was my first thought, too. So I looked into the logfiles, but there was
nothing about this at all.
>
> You can query this info directly from pbs_mom using momctl or a small util
> I wrote awhile ago called dumpmom
> (http://www-rcf.usc.edu/~garrick/dumpmom.c)
>
> To use momctl, first get the session list, then get the memory usage of
> that session. Here's an example with a node having 2 sessions, and 1 of
> them is using 100MB.
>
> $ momctl -q sessions -h hpc0961
> hpc0961: sessions = 'sessions=30631 30651'
> $ momctl -q 'mem[session=30631]' -h hpc0961
> hpc0961: mem[session=30631] = 'mem[session=30631]=120856kb'
>
> dumpmom is easier for this particular purpose, just do 'dumpmom hpc0961'
> and it will print out lots of similar information.
Thanks, momctl as also your dumpmom are working fine.
>
> If you can verify that pbs_mom is sending the correct info, then we can
> look into pbs_server.
Well, the output of momctl and pbsmom shows, that already pbs_mom is failing.
mem[session=24986]=2115728kb
This value should be 6GB.
Well, I think I just found the reason for the problem, pbs_mom reads the
memory usage per process in the mem_sum() function of mom_mach.c and uses the
structure proc_stat_t member vsize there. This vsize variable is defined as
unsigned, a quick test just showed me, that unsigned is only a 32bit type on
x86_64. I will correct this to 'unsigned long' (which is 64bit) tomorrow, I'm
just too tired now.
Thanks again for your help,
Bernd
--
Bernd Schubert
PCI / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: bernd.schubert at pci.uni-heidelberg.de
More information about the torqueusers
mailing list