Bugzilla – Bug 144
Possible memory leak in pbs_server
Last modified: 2013-02-13 17:24:05 MST
You need to
before you can comment on or make changes to this bug.
In our cluster, pbs_server starts using a low amount of mem:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18506 root 15 0 151m 124m 740 S 35.6 3.2 0:05.24 pbs_server
but after few minutes, it uses more and more memory:
18506 root 15 0 535m 503m 1220 S 25.9 12.7 4:23.37 pbs_server
18506 root 15 0 1021m 986m 1224 S 52.8 24.9 27:34.41 pbs_server
and after few hours, it uses ALL the available mem and hosts starts swapping,
jobs cannot start because master cannot fork:
pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed
set server scheduling = False
set server acl_host_enable = False
set server acl_hosts = pbs03.pic.es
set server managers = XXXXXXXX
set server managers += firstname.lastname@example.org
set server managers += email@example.com
set server operators = firstname.lastname@example.org
set server operators += email@example.com
set server default_queue = glong_sl5
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server default_node = slc5_x64
set server node_pack = False
set server job_stat_rate = 300
set server mail_domain = never
set server next_job_number = 18080874
Created an attachment (id=78) [details]
maps & smaps of pbs_server process
Any progress on this?
We heave the same problem with Torque 2.5.8 and Moab 6.1.
As a workaround we use cron script which restarts server every hour.
Arnau, what scheduler software are you using? What is the size of your cluster
(nodes/cores/average number of jobs)?
We currently have 1k nodes and around 11k cores, 5k jobs on avg, core
utilization is around 95%
I am not able to debug the issue myself on production with valgrind because it
slows down things too much. The bigger the cluster the faster problem occurs.
we use MAUI on a 3200 jobs slots (315 nodes).
and our cluster start having problems when it has about 6.5/7k jobs
Do you have gpus in your cluster?
After having applied patch
memory leaks are almost gone.
Something however is still leaking but we haven't figured out what it is.
With 25k jobs per day it takes 4 weeks to cross 1.5 GB so i think it is a good