[torquedev] [Fwd: TORQUE Logging Messages]

David Singleton David.Singleton at anu.edu.au
Mon Aug 20 17:05:45 MDT 2007


Josh Butikofer wrote:
> Everyone,
> 
> I'm working with a TORQUE logging issue. We are seeing the following log message three
> times a second on some compute nodes running pbs_mom.
> 
> ug 16 04:02:18 hplcnla004 pbs_mom: Success (0) in sessions, 2460:
> get_proc_stat
> Aug 16 04:02:18 hplcnla004 pbs_mom: Success (0) in sessions, 2460:
> get_proc_stat
> Aug 16 04:02:18 hplcnla004 pbs_mom: Success (0) in nusers, 2460:
> get_proc_stat

sessions() and nusers() both run through all processes in /proc so
any generic problem with reading /proc/$pid/stat would return a
message for all pids.  So the above looks more like an issue with
the specific /proc entry for pid 2460 than any more generic
problem. It would have been useful to see cat /proc/2460/stat
in this case.

I'm not sure why you are particularly interested the pbs_mom /proc
entries below.  Was pid 2460 in the example above the pbs_mom
process?

We have the one version of MOM (not torque but reading /proc
nonetheless) running on both 2.6.9-42 and 2.6.9-55 quite happily.
I dont think there have been any changes in /proc/pid/stat
entries.

David

> 
> These nodes recently had a new driver/kernel installed on them. Looking at old e-mails from the
> mailing list, it appears that this is due to a bad reading of the /proc/$pid/stat file. Below are
> the "good" and "bad" stat files and their respective kernels.
> 
> Good one -
> 

I think your cut-and-paste of this line lost some spaces.

> [root at hplcnla025 3273]# cat stat
> 3273 (pbs_mom) S 1 3273 3273 0 -1 4194624 8308680 200730468 5 251 7088
> 12989 22267891 239071 16 0 1 05440 10121216 347 18446744073709551615
                                0 5540 ?
> 4194304 4428684 548682071344 18446744073709551615 1828983421330 0 4096
                                                     182898342133 0 ?
> 25258499 0 0 0 17 0 0 0
> [root at hplcnla025 3273]# uname -a
> Linux hplcnla025 2.6.9-42.0.10.EL_lustre-1.6.0.1smp #1 SMP Thu May 3
> 20:37:18 MDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> Bad one -
> 
> jwobrya]@hplcnla002:/proc/3348
> $ cat stat
> 3348 (pbs_mom) S 1 3348 3348 0 -1 4194624 78635 0 3 0 5 9 0 0 16 0 1 0 5488
> 9228288 282 18446744073709551615 4194304 4428684 548682071344
> 18446744073709551615 182898344229 0 0 4096 25258499 18446744073709551615 0
> 0 17 0 0 0
> [jwobrya]@hplcnla002:/proc/3348
> $ uname -a
> Linux hplcnla002 2.6.9-55.EL_lustre-1.6.1smp #1 SMP Fri Aug 10 09:16:20 MDT
> 2007 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> My question is has this ever been resolved, or is it something we have to tweak each time the kernel
> changes the stat file's format? If not, what are next steps to eliminating/decreasing the number of
> node entries? Note that I'm not a TORQUE developer even though I do work for CRI. :)
> 
> Thanks,
> 



More information about the torquedev mailing list