[torqueusers] instability in torque 2.3.1

Miles O'Neal meo at intrinsity.com
Sun Aug 10 13:09:06 MDT 2008


We are having horrible problems with torque 2.3.1 .
We ran 2.1.8 for a while and experienced problems
with torque crashing or having issues talking to
moms anywhere from once a week to once a month.

We switched to 2.3.1 and started having the pbs_server
crash anywhere from once every 2 days to several times
a day.  Often this is accompanied by many moms crashing
as well.  Sometimes this leaves orphaned jobs on clients,
just to add to the confusion.  Occasionally we have to
stop maui for several seconds and restart it after
restarting pbs_server, or they won't communicate.

We tried upgrading to 2.3.2 but as Tom noted in another
post to the list, it was not honoring requests based on
memory size (e.g., -l mem=5Gb).  Since this broke our
job flow severely (you try running a 10GB job on a 2GB
box and see how long it takes) we backed up to 2.3.1.

Also in 2.3.1 we do not seem to be able to adjust a
job's priority within a queue at qsub time.  Didn't
that work in the past?

We have 500+ nodes, 37 queues, and a mix of job that
run anywhere from days to a few minutes.  A handful
of the queues are routing queues.  This all mostly
worked in 2.1,8 .  All systems involved are running
CentOS 4.4 x86_64.

Anyone else running 2.3.x?  How well is it working?
Any suggestions?


More information about the torqueusers mailing list