[torqueusers] Understanding & dealing with torque error codes
jacksond at supercluster.org
Fri Oct 29 11:18:16 MDT 2004
Let focus on your issues one at a time. The 15041, PBS_MOMREJECT
issue can occur under several situations as a job attempts to start.
TORQUE 1.1.0 p3 and p4 both support verbose logging to identify exactly
what is causing this issue. The first step should be to get one of
these versions in place, and then start pbs_server with a LOGLEVEL of 3
or higher (to activate the verbose logging). To do this, export
PBSLOGLEVEL=3 before starting pbs_server. You should then have
information about what failure is actually occuring, ie 'cannot stage
input file', etc in the pbs_server logs. You will then want to examine
the compute host's pbs_mom daemon logs. Again, set the loglevel to 3
before starting pbs_mom. The mom logs should indicate the low level
failure. P4 has the best logging on the mom side.
If this does not identify the system level issue for you, send us the
logs and we will see what we can do.
On Fri, 2004-10-29 at 10:07, D.J.Baker at soton.ac.uk wrote:
> Thank you again for your consideration of these issues. We're finding
> that the most frequent errors appear to be 15001, 15004 and 15041. It may
> be useful for us to send our logs to you -- at the momemt that is difficult
> to sort out since I'm at home, and so remote from the nodes. Thanks for the
> offer to investogate the problems further.
> On the level of prevention could I please broach the following questions to
> you, and torque-users...
> I see that errors 15001, and 15004 appear to be cleared following a restart
> of the mom. My initial thought was to try to keep the slate as clear as
> possible by perhaps restarting the mom following each job. Does this make
> any valid sense, and would doing such a thing potentially cause any issues
> in the system? It's interesting that the error appears to be random,
> however annoyingly does seem to cause jobs to queue up at the problem node.
> Error 15041 is very problematic. A closer investigation shows that multiple
> instances of jobs appear to run on the system. Briefly jobA trys to run on
> nodeX. nodeX, however reports 15041 and the job is sent elsewhere. The job
> then starts to run on nodeY. Unfortunately despite the rejection of the job
> on nodeX -- it does in fact run on that nodeX as well... With respect to
> addressing this issue would you recommend upgrading to 1.1.0pX where X is
> either 3 or 4?
> Finally would an upgrade help to eliminate errors 15001 and 15004? What is
> the theory and the experience of the community, please.
> Thank you again -- David.
> Quoting Dave Jackson <jacksond at supercluster.org>:
> > David,
> > 15004 failures indicate 'invalid values' being passed into some
> > request. They can occur with job holds, job dependencies, job
> > submissions from invalid clients, attempt to start jobs in a routing
> > queue, etc.
> > The latest torque 1.1.0p4 snapshot has logging enhancements which
> > record the core reason for most of these failures. Also, regarding
> > failure code lookup, this can be found at
> > http://clusterresources.com/torquedocs/2.1debugging.shtml or in
> > pbs_error.h
> > If you could send us your pbs_server and/or pbs_mom logs, we can
> > assist you further. This would be most useful if the daemons were
> > started with PBSLOGLEVEL set to 3 or highter. Please include annotation
> > describing what activity was occuring when the failure took place.
> > Thanks,
> > Dave
> > On Thu, 2004-10-28 at 06:07, David Baker wrote:
> > > Hi,
> > > We are currently setting up a medium sized (160 nodes) cluster
> > based on
> > > torque (1.1.0p0), and maui (3.2.6p7). We are finding that the node moms
> > > report various error codes, and that we can not find any documentation
> > or
> > > helps on dealing with these conditions. The most problematic error is
> > > 15004 -- the mom appears to be in a state of confusion, and rejects
> > jobs
> > > until the mom is restarted. Does anyone out there have an automated
> > > procedure for preventing and/or dealing with this issue, please?
> > >
> > > Other error conditions we have seen are 15001, 15009 and 15029. In
> > general
> > > terms does supercluster or any users/group have access to any
> > documentation
> > > that might enable us to understand and control these conditions,
> > please?
> > >
> > > Your advice and comments would be appreciated, please.
> > > Thank you -- David Baker.
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers