Bugzilla – Bug 144
Possible memory leak in pbs_server
Last modified: 2013-02-13 17:24:05 MST
You need to log in before you can comment on or make changes to this bug.
In our cluster, pbs_server starts using a low amount of mem: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18506 root 15 0 151m 124m 740 S 35.6 3.2 0:05.24 pbs_server but after few minutes, it uses more and more memory: 18506 root 15 0 535m 503m 1220 S 25.9 12.7 4:23.37 pbs_server [...] 18506 root 15 0 1021m 986m 1224 S 52.8 24.9 27:34.41 pbs_server and after few hours, it uses ALL the available mem and hosts starts swapping, jobs cannot start because master cannot fork: pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed set server scheduling = False set server acl_host_enable = False set server acl_hosts = pbs03.pic.es set server managers = XXXXXXXX set server managers += monami@pbs03.pic.es set server managers += root@pbs03.pic.es set server operators = monami@pbs03.pic.es set server operators += root@pbs03.pic.es set server default_queue = glong_sl5 set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600 set server node_ping_rate = 300 set server node_check_rate = 600 set server tcp_timeout = 6 set server default_node = slc5_x64 set server node_pack = False set server job_stat_rate = 300 set server mail_domain = never set server next_job_number = 18080874 torque-server-2.5.6-0.cri.snap.201104041023.x86_64
Created an attachment (id=78) [details] maps & smaps of pbs_server process
Hi, Any progress on this? We heave the same problem with Torque 2.5.8 and Moab 6.1. As a workaround we use cron script which restarts server every hour. Arnau, what scheduler software are you using? What is the size of your cluster (nodes/cores/average number of jobs)? We currently have 1k nodes and around 11k cores, 5k jobs on avg, core utilization is around 95% I am not able to debug the issue myself on production with valgrind because it slows down things too much. The bigger the cluster the faster problem occurs. Best Regards -- Lukasz Flis
Hi, we use MAUI on a 3200 jobs slots (315 nodes). and our cluster start having problems when it has about 6.5/7k jobs (idle+running). Cheers, Arnau
Arnau, Do you have gpus in your cluster? After having applied patch https://github.com/adaptivecomputing/torque/commit/ab3c3dd9a4422155c31f29ad945a30f425c45ce8 memory leaks are almost gone. Something however is still leaking but we haven't figured out what it is. With 25k jobs per day it takes 4 weeks to cross 1.5 GB so i think it is a good result ;)