[torqueusers] stability observations

Alexander Saydakov saydakov at yahoo-inc.com
Thu Apr 6 16:34:14 MDT 2006


1. You mean some memory leak detector? I have not tried that.

2. Too bad. Maybe I need to set up Maui. Especially considering #1.

3. Below is the latest crash few hours after I deleted some nodes. I think
there was no activity right after I deleted them, but it crashed as soon as
new jobs started to pile up.

Core was generated by `pbs_server'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libkvm.so.2...done.
Reading symbols from /usr/lib/libc.so.4...done.
Reading symbols from /usr/libexec/ld-elf.so.1...done.
#0  0x1005df9 in bad_node_warning (addr=1122282515) at node_func.c:226
226         if (pbsndlist[i]->nd_addrs[0] == addr)



-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
Sent: Thursday, April 06, 2006 3:03 PM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] stability observations

On Thu, Apr 06, 2006 at 12:57:47PM -0700, Alexander Saydakov alleged:
> After a few months of running 2.0.0p7 and 2.0.0p8 on FreeBSD 4.10 I
observed
> the following:
> 
>  
> 
> 1.	pbs_sched has a memory leak. Its footprint keeps growing every day,
> so after a fresh start it reaches 300M in a few days

Can you capture this in valgrind?  (or whatever freebsd has)


> 2.	pbs_sched has some bug in the algorithm. Quite often it picks up
> some random jobs from lower priority queues despite of a lot of jobs in
> higher priority queues.

I don't know how much support you are going to get for this.  Noone is
maintaining pbs_sched.


> 3.	pbs_server is unstable when some configuration changes are made.
> Strangely, but it can crash after a few minutes since a change. Not all
> changes are bad. Adding nodes and queues, or adjusting their parameters is
> fine. After deleting nodes (with patch! With no patch it died
immediately),
> for instance, it died within a few hours. If you don't touch it, it runs
> forever.

Can you capture this in gdb?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California



More information about the torqueusers mailing list