[torquedev] SMP system issues with pbs_mom, in mom_mach.c (15007)

Andrew Keen keenandr at msu.edu
Thu May 29 10:56:36 MDT 2008


Hi,

I'm running into the error described at: 
http://www.clusterresources.com/pipermail/torqueusers/2007-August/006046.html 
on our 128 CPU SMP system.

But we're not running CPUSets, so the provided patch won't work. Here's 
the gprof output (time was not reporting correctly)

gprof -b /usr/local/sbin/pbs_mom gmon.out
Flat profile:

Each sample counts as 0.000976562 seconds.
no time accumulated

% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 3590616 0.00 0.00 injob
0.00 0.00 0.00 97512 0.00 0.00 str_nc_cmp
0.00 0.00 0.00 31997 0.00 0.00 get_proc_stat
0.00 0.00 0.00 12540 0.00 0.00 clear_attr
0.00 0.00 0.00 5286 0.00 0.00 find_resc_def
0.00 0.00 0.00 4890 0.00 0.00 find_attr
0.00 0.00 0.00 4422 0.00 0.00 find_resc_entry
-snip-


Call graph


granularity: each sample hit covers 4 byte(s) no time propagated
index % time self children called name
0.00 0.00 1107369/3590616 mem_sum [20]
0.00 0.00 1107369/3590616 resi_sum [22]
0.00 0.00 1375878/3590616 cput_sum [17]
[1] 0.0 0.00 0.00 3590616 injob [1]

Migrating the mom to 2.3 has reduced the impact on the server, but the 
mom still spends a lot of time crawling the /proc tree.

-Andy


More information about the torquedev mailing list