[torqueusers] pbs_server not keeping up

Gus Correa gus at ldeo.columbia.edu
Wed Aug 29 10:07:28 MDT 2012


On 08/29/2012 11:35 AM, Tony Schreiner wrote:
> On Aug 29, 2012, at 10:38 AM, Tony Schreiner wrote:
>
>> On my smallish cluster with torque 2.5.7.
>>
>> A user submitted about 8000 jobs to a routing queue, which feeds to an execution queue with 200 runnable slots.
>>
>> At the moment, bps_server is unable to handle it,  pbsnodes returns no nodes found, qstat -q takes a long time and shows nothing.
>> This is the tail of the latest server_logs file
>>
>> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux24 not detected in 1346250363 seconds, marking node down
>> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux25 not detected in 1346250363 seconds, marking node down
>> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux27 not detected in 1346250363 seconds, marking node down
>> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux29 not detected in 1346250363 seconds, marking node down
>> 08/29/2012 10:26:40;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51)
>> 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51)
>> 08/29/2012 10:27:18;000d;PBS_Server;Job;797357.portal;Post job file processing error; job 797357.portal on host linux29/1
>> 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=53)
>> 08/29/2012 10:28:33;0004;PBS_Server;Svr;check_nodes;node linux20 not detected in 1346250513 seconds, marking node down
>>
>> here are some server settings
>>
>> set server log_events = 511
>> set server mail_from = adm
>> set server query_other_jobs = True
>> set server resources_default.ncpus = 1
>> set server resources_default.nodect = 1
>> set server scheduler_iteration = 600
>> set server node_check_rate = 150
>> set server tcp_timeout = 6
>> set server mom_job_sync = True
>> set server keep_completed = 86400
>> set server log_keep_days = 365
>> set server next_job_number = 808497
>> set server job_log_keep_days = 365
>>
>>
>> is there anything I can change to help move things along.
>> Thanks
>>
>> Tony Schreiner
> Addendum, it seems to have more to do with the number of entries in the server_priv/jobs directory. There were about 50,000 in there. When I deleted the older ones (about half), operation returned to normal. I'm going to reduce keep_completed, at least temporarily.
>
> Tony
>
>
Hi Tony

Have you tried to set the max_queueable or max_user_queueable attribute 
of your execution queue?
http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php#attributes

I guess this will throttle the routing-to-execution queue job flux, and 
reduce the clutter.

Gus Correa


More information about the torqueusers mailing list