[torquedev] [Fwd: TORQUE Logging Messages]

Josh Butikofer josh at clusterresources.com
Mon Aug 20 13:41:29 MDT 2007


Everyone,

I'm working with a TORQUE logging issue. We are seeing the following log message three
times a second on some compute nodes running pbs_mom.

ug 16 04:02:18 hplcnla004 pbs_mom: Success (0) in sessions, 2460:
get_proc_stat
Aug 16 04:02:18 hplcnla004 pbs_mom: Success (0) in sessions, 2460:
get_proc_stat
Aug 16 04:02:18 hplcnla004 pbs_mom: Success (0) in nusers, 2460:
get_proc_stat

These nodes recently had a new driver/kernel installed on them. Looking at old e-mails from the
mailing list, it appears that this is due to a bad reading of the /proc/$pid/stat file. Below are
the "good" and "bad" stat files and their respective kernels.

Good one -

[root at hplcnla025 3273]# cat stat
3273 (pbs_mom) S 1 3273 3273 0 -1 4194624 8308680 200730468 5 251 7088
12989 22267891 239071 16 0 1 05440 10121216 347 18446744073709551615
4194304 4428684 548682071344 18446744073709551615 1828983421330 0 4096
25258499 0 0 0 17 0 0 0
[root at hplcnla025 3273]# uname -a
Linux hplcnla025 2.6.9-42.0.10.EL_lustre-1.6.0.1smp #1 SMP Thu May 3
20:37:18 MDT 2007 x86_64 x86_64 x86_64 GNU/Linux


Bad one -

jwobrya]@hplcnla002:/proc/3348
$ cat stat
3348 (pbs_mom) S 1 3348 3348 0 -1 4194624 78635 0 3 0 5 9 0 0 16 0 1 0 5488
9228288 282 18446744073709551615 4194304 4428684 548682071344
18446744073709551615 182898344229 0 0 4096 25258499 18446744073709551615 0
0 17 0 0 0
[jwobrya]@hplcnla002:/proc/3348
$ uname -a
Linux hplcnla002 2.6.9-55.EL_lustre-1.6.1smp #1 SMP Fri Aug 10 09:16:20 MDT
2007 x86_64 x86_64 x86_64 GNU/Linux


My question is has this ever been resolved, or is it something we have to tweak each time the kernel
changes the stat file's format? If not, what are next steps to eliminating/decreasing the number of
node entries? Note that I'm not a TORQUE developer even though I do work for CRI. :)

Thanks,

-- 
Joshua Butikofer
Cluster Resources, Inc.

josh at clusterresources.com
Voice: (801) 717-3707
Fax:   (801) 717-3738
--------------------------


More information about the torquedev mailing list