Bug 144 - Possible memory leak in pbs_server
: Possible memory leak in pbs_server
Status: NEW
Product: TORQUE
: 2.5.x
: PC Linux
: P5 major
Assigned To: David Beer
  Show dependency treegraph
Reported: 2011-07-07 04:48 MDT by Arnau
Modified: 2013-02-13 17:24 MST (History)
2 users (show)

See Also:

maps & smaps of pbs_server process (137 bytes, text/tgz)
2011-07-07 09:18 MDT, Arnau


You need to log in before you can comment on or make changes to this bug.

Description Arnau 2011-07-07 04:48:03 MDT
In our cluster, pbs_server starts using a low amount of mem:

PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME+ COMMAND                 
18506 root      15   0  151m 124m  740 S 35.6  3.2   0:05.24 pbs_server 

but after few minutes, it uses more and more memory:

18506 root      15   0  535m 503m 1220 S 25.9 12.7   4:23.37 pbs_server 
18506 root      15   0 1021m 986m 1224 S 52.8 24.9  27:34.41 pbs_server  

and after few hours, it uses ALL the available mem and hosts starts swapping,
jobs cannot start because master cannot fork:

pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed 

set server scheduling = False
set server acl_host_enable = False
set server acl_hosts = pbs03.pic.es
set server managers = XXXXXXXX
set server managers += monami@pbs03.pic.es
set server managers += root@pbs03.pic.es
set server operators = monami@pbs03.pic.es
set server operators += root@pbs03.pic.es
set server default_queue = glong_sl5
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server default_node = slc5_x64
set server node_pack = False
set server job_stat_rate = 300
set server mail_domain = never
set server next_job_number = 18080874

Comment 1 Arnau 2011-07-07 09:18:32 MDT
Created an attachment (id=78) [details]
maps & smaps of pbs_server process
Comment 2 Lukasz Flis 2011-10-21 14:21:28 MDT

Any progress on this?

We heave the same problem with Torque 2.5.8 and Moab 6.1.
As a workaround we use cron script which restarts server every hour.

Arnau, what scheduler software are you using? What is the size of your cluster
(nodes/cores/average number of jobs)?

We currently have 1k nodes and around 11k cores, 5k jobs on avg, core
utilization is around 95%

I am not able to debug the issue myself on production with valgrind because it
slows down things too much. The bigger the cluster the faster problem occurs.

Best Regards
Lukasz Flis
Comment 3 Arnau 2011-10-24 07:26:59 MDT

we use MAUI on a 3200  jobs slots (315 nodes).
and our cluster start having problems when it has about 6.5/7k jobs

Comment 4 Lukasz Flis 2013-02-13 17:24:05 MST

Do you have gpus in your cluster?
After having applied patch
memory leaks are almost gone. 

Something however is still leaking but we haven't figured out what it is.
With 25k jobs per day it takes 4 weeks to cross 1.5 GB so i think it is a good
result ;)