[torqueusers] instability in torque 2.3.1
josh at clusterresources.com
Mon Aug 11 08:07:52 MDT 2008
There were known problems with 2.3.1 regarding handshaking between the mom and server, log file
descriptors, and errors when Moab/Maui tried scheduling with neednodes, but I wasn't aware of any
reported segfaults or crashes.
As far as I've heard, other sites have been using 2.3.2 with much more success than 2.3.1. We
usually recommend that users upgrade to 2.3.2 if they are having trouble with 2.3.1 or if they are
considering moving from 2.1.x or 2.2.x to 2.3.x.
Miles O'Neal wrote:
> We are having horrible problems with torque 2.3.1 .
> We ran 2.1.8 for a while and experienced problems
> with torque crashing or having issues talking to
> moms anywhere from once a week to once a month.
> We switched to 2.3.1 and started having the pbs_server
> crash anywhere from once every 2 days to several times
> a day. Often this is accompanied by many moms crashing
> as well. Sometimes this leaves orphaned jobs on clients,
> just to add to the confusion. Occasionally we have to
> stop maui for several seconds and restart it after
> restarting pbs_server, or they won't communicate.
> We tried upgrading to 2.3.2 but as Tom noted in another
> post to the list, it was not honoring requests based on
> memory size (e.g., -l mem=5Gb). Since this broke our
> job flow severely (you try running a 10GB job on a 2GB
> box and see how long it takes) we backed up to 2.3.1.
> Also in 2.3.1 we do not seem to be able to adjust a
> job's priority within a queue at qsub time. Didn't
> that work in the past?
> We have 500+ nodes, 37 queues, and a mix of job that
> run anywhere from days to a few minutes. A handful
> of the queues are routing queues. This all mostly
> worked in 2.1,8 . All systems involved are running
> CentOS 4.4 x86_64.
> Anyone else running 2.3.x? How well is it working?
> Any suggestions?
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers