[torquedev] Torque 2.5.8 - memory leaks

Lukasz Flis l.flis at cyf-kr.edu.pl
Wed Sep 28 16:29:15 MDT 2011


Hello Ken,

On Thursday 29 September 2011 00:00:32 Ken Nielson wrote:
> ----- Original Message -----
> 
> > From: "Lukasz Flis" <l.flis at cyf-kr.edu.pl>
> > To: "Torque Developers mailing list" <torquedev at supercluster.org>
> > Sent: Tuesday, September 27, 2011 5:15:34 PM
> > Subject: [torquedev] Torque 2.5.8 - memory leaks
> >
> > Hi,
> >
> > We are running medium cluster with Torque and Moab,
> > Average number of jobs is usually around 4k
> > Number of nodes: 950
> > Number of cores (at the moment) 10k
> >
> > We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are
> > observing torque memory issues. I'm attaching daily graph from our
> > ganglia monitoring system. On average it takes 3 hours for torque to
> > consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be
> > restarted as it is unable to perform fork operation due to lack of
> > available memory, then OOM killer gets in action.
> >
> >   pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job,
> >   fork
> > failed
> >
> > Due to the scale of the system I am unable to run pbs_server under
> > valgrind to find the source of leak. I did some testing on our test
> > cluster but i'm not sure how accurate results valgrind provides:
> >
> > a lot of messages point to decode_str function:
> > ==31895== 41 bytes in 4 blocks are definitely lost in loss record 40
> > of 81
> > ==31895==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> > ==31895==    by 0x452BD8: decode_str (attr_fn_str.c:144)
> > ==31895==    by 0x40A768: recov_attr (attr_recov.c:512)
> > ==31895==    by 0x44909C: svr_recov (svr_recov.c:204)
> > ==31895==    by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> > ==31895==    by 0x4213A2: main (pbsd_main.c:1465)
> >
> > so as to decode_arst_direct:
> >
> > ==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are
> > definitely
> > lost in loss record 55 of 84
> > ==31895==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> > ==31895==    by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189)
> > ==31895==    by 0x44F3DE: decode_arst (attr_fn_arst.c:311)
> > ==31895==    by 0x40A768: recov_attr (attr_recov.c:512)
> > ==31895==    by 0x44909C: svr_recov (svr_recov.c:204)
> > ==31895==    by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> > ==31895==    by 0x4213A2: main (pbsd_main.c:1465)
> >
> >
> > I am going to dig the sources a bit and see if memory allocated by
> > above
> > functions is freed properly.
> > However any suggestions and hints will be welcome as I might be
> > unable
> > to fix it all myself.
> >
> > Thank you for attention
> > --
> > Lukasz Flis
> 
> What scheduler are you using?

We are using MOAB 6.1.1.
server scheduling is set to True.

We had similar problems with 2.4.12 and Maui but amount of consumed/leaked 
memory was smaller as Torque was 32 bit version.

> Ken Nielson
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
> 



More information about the torquedev mailing list