Bug 144 - Possible memory leak in pbs_server
: Possible memory leak in pbs_server
Status: NEW
Product: TORQUE
pbs_server
: 2.5.x
: PC Linux
: P5 major
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2011-07-07 04:48 MDT by Arnau
Modified: 2013-02-13 17:24 MST (History)
2 users (show)

See Also:


Attachments
maps & smaps of pbs_server process (137 bytes, text/tgz)
2011-07-07 09:18 MDT, Arnau
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description Arnau 2011-07-07 04:48:03 MDT
In our cluster, pbs_server starts using a low amount of mem:

PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME+ COMMAND                 
18506 root      15   0  151m 124m  740 S 35.6  3.2   0:05.24 pbs_server 

but after few minutes, it uses more and more memory:

18506 root      15   0  535m 503m 1220 S 25.9 12.7   4:23.37 pbs_server 
[...]
18506 root      15   0 1021m 986m 1224 S 52.8 24.9  27:34.41 pbs_server  

and after few hours, it uses ALL the available mem and hosts starts swapping,
jobs cannot start because master cannot fork:

pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed 

set server scheduling = False
set server acl_host_enable = False
set server acl_hosts = pbs03.pic.es
set server managers = XXXXXXXX
set server managers += monami@pbs03.pic.es
set server managers += root@pbs03.pic.es
set server operators = monami@pbs03.pic.es
set server operators += root@pbs03.pic.es
set server default_queue = glong_sl5
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server default_node = slc5_x64
set server node_pack = False
set server job_stat_rate = 300
set server mail_domain = never
set server next_job_number = 18080874


torque-server-2.5.6-0.cri.snap.201104041023.x86_64
Comment 1 Arnau 2011-07-07 09:18:32 MDT
Created an attachment (id=78) [details]
maps & smaps of pbs_server process
Comment 2 Lukasz Flis 2011-10-21 14:21:28 MDT
Hi,

Any progress on this?

We heave the same problem with Torque 2.5.8 and Moab 6.1.
As a workaround we use cron script which restarts server every hour.

Arnau, what scheduler software are you using? What is the size of your cluster
(nodes/cores/average number of jobs)?

We currently have 1k nodes and around 11k cores, 5k jobs on avg, core
utilization is around 95%

I am not able to debug the issue myself on production with valgrind because it
slows down things too much. The bigger the cluster the faster problem occurs.

Best Regards
--
Lukasz Flis
Comment 3 Arnau 2011-10-24 07:26:59 MDT
Hi,

we use MAUI on a 3200  jobs slots (315 nodes).
and our cluster start having problems when it has about 6.5/7k jobs
(idle+running).

Cheers,
Arnau
Comment 4 Lukasz Flis 2013-02-13 17:24:05 MST
Arnau,

Do you have gpus in your cluster?
After having applied patch
https://github.com/adaptivecomputing/torque/commit/ab3c3dd9a4422155c31f29ad945a30f425c45ce8
memory leaks are almost gone. 

Something however is still leaking but we haven't figured out what it is.
With 25k jobs per day it takes 4 weeks to cross 1.5 GB so i think it is a good
result ;)