[torqueusers] Server not talking to MOMs at all
Lennart Karlsson
Lennart.Karlsson at nsc.liu.se
Tue Sep 13 06:47:49 MDT 2005
Garrick,
You wrote:
> On Thu, Sep 08, 2005 at 10:52:00AM +0200, Lennart Karlsson alleged:
> > The state of Torque today, in our environments, is that it behaves
> > badly or crashes (e.g. the current favourite: eats all available internal
> > and swap memory) in one way or another very frequently (I appreciate, without
> > measuring, that it is a factor of more than 100 times) compared to the
>
> That should really not be happening. My server host has 38 days uptime
> and pbs_server is using 15MB of ram. That's with 1700 nodes and 1200+
> jobs every day.
>
> Set 'PBSDEBUG=5' in your env and run pbs_server under gdb or valgrind.
> Please send any valgrind errors and/or a gdb backtrace of a crash.
On the same issue, Dave Jackson wrote:
> We are unaware of other crashes anywhere on any system. I would be
> very interested in determining what is unique about your policies,
> environment, workload, or resources. If you are running anything more
> recent than patch 5, please catch these failures under gdb 'where' or
> with valgrind. We will get them fixed immediately.
We did run version torque_1.2.0p4, so I assume that you are not interested
in valgrind output from that?
Now we changed to version 1.2.0p6 and we will try to catch the problem
with valgrind if and when it comes back.
Our environment on the mentioned system? We have two PBS queues, one for
normal, high-priority jobs that shall run to completion within a user-defined
wallclock limit, and one for low-priority jobs that may be preempted
(PREEMPTIONPOLICY REQUEUE in Maui, now version maui-3.2.6p14-snap.1125680408)
anytime their nodes are needed by normal jobs.
This is a 74 node plus one master node ("head node"?) Linux cluster for one
research department. The strategy is that each user "always" (with some
over-booking of resources, though) has quick access to a few compute nodes
as a handy compute resource when need arises. Resources not used can be
filled with low-priority jobs in an opportunistic fashion. Bigger compute
demands are fulfilled on other clusters.
NFS workload can be pretty high (no, not extremely high) on the master node,
that also runs Maui and Torque, and when also the weekly full backup runs
we have a load average of between eight and ten. Typically 100-650 jobs a
day are qsub:ed and the number of requeued jobs varies widely from none
(weekends, vacation period) to more than several thousands a day. I think
that the problems appear when we have a lot of requeues.
Maui configuration, in short:
==============================================================================
SERVERHOST green
SERVERPORT 42559
SERVERMODE NORMAL
RMCFG[base] TYPE=PBS
RMCFG[base] TIMEOUT=90
AMCFG[base] TYPE=NONE
RMPOLLINTERVAL 00:02:00
RESDEPTH 120
DEFERTIME 0:30:00
DEFERCOUNT 5000
QUEUETIMEWEIGHT 0
XFACTORWEIGHT 1
QOSWEIGHT 1
BACKFILLPOLICY BESTFIT
RESERVATIONPOLICY CURRENTHIGHEST
RESERVATIONDEPTH 300
RESERVATIONRETRYTIME 0:10:00
JOBPRIOACCRUALPOLICY FULLPOLICY
NODEALLOCATIONPOLICY MINRESOURCE
JOBNODEMATCHPOLICY EXACTNODE
JOBMAXSTARTTIME 0:10:00
JOBMAXOVERRUN 0:05:00
NODEACCESSPOLICY SINGLEJOB
SRCFG[interakt] STARTTIME=2:00:00 ENDTIME=23:59:59
SRCFG[interakt] PERIOD=DAY DAYS=MON,TUE,WED,THU,FRI DEPTH=7
SRCFG[interakt] TASKCOUNT=4
SRCFG[interakt] TIMELIMIT=6:00:05
SRCFG[interakt] CLASSLIST=riskjobb
PREEMPTIONPOLICY REQUEUE
CLASSCFG[riskjobb] QDEF=Risk
CLASSCFG[workq] QDEF=Normal
SYSCFG PLIST=DEFAULT QDEF=Disable
QOSCFG[Disable] OMAXPS=1
QOSCFG[Risk] PRIORITY=1 XFWEIGHT=1 QTWEIGHT=1
QFLAGS=PREEMPTEE,IGNALL OMAXIJOB=2
QOSCFG[Normal] PRIORITY=100000 XFWEIGHT=1000 QFLAGS=PREEMPTOR
QOSCFG[High] PRIORITY=10000000 XFWEIGHT=1000 QFLAGS=PREEMPTOR
USERCFG[DEFAULT] MAXPS=5184000
GROUPCFG[nsc] MAXIJOB=10 MAXPROC=36
USERCFG[anders] MAXIJOB=10 MAXPROC=8 MAXPS=5184000
USERCFG[svetlana MAXIJOB=10 MAXPROC=8 MAXPS=5184000
# and so on with the other users
==============================================================================
PBS server and mom configurations:
==============================================================================
create queue workq
set queue workq queue_type = Execution
set queue workq enabled = True
set queue workq started = True
create queue riskjobb
set queue riskjobb queue_type = Execution
set queue riskjobb enabled = True
set queue riskjobb started = True
set server scheduling = True
set server default_queue = workq
set server log_events = 255
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.walltime = 00:00:01
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server job_stat_rate = 30
==============================================================================
$logevent 127
$prologalarm 120
$clienthost pbsserver
$usecp *:/home /home
==============================================================================
We run prologue, to remove temporary files, and epilogue to remove processes.
Best regards,
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
National Supercomputer Centre in Linkoping, Sweden
http://www.nsc.liu.se
More information about the torqueusers
mailing list