[torqueusers] Understanding & dealing with torque error codes

D.J.Baker at soton.ac.uk D.J.Baker at soton.ac.uk
Fri Oct 29 10:07:28 MDT 2004


David,
     Thank you again for your consideration of these issues. We're finding
that the most frequent errors appear to be 15001, 15004 and 15041. It may
be useful for us to send our logs to you -- at the momemt that is difficult
to sort out since I'm at home, and so remote from the nodes. Thanks for the
offer to investogate the problems further.


On the level of prevention could I please broach the following questions to
you, and torque-users...

I see that errors 15001, and 15004 appear to be cleared following a restart
of the mom. My initial thought was to try to keep the slate as clear as
possible by perhaps restarting the mom following each job. Does this make
any valid sense, and would doing such a thing potentially cause any issues
in the system? It's interesting that the error appears to be random,
however annoyingly does seem to cause jobs to queue up at the problem node.

Error 15041 is very problematic. A closer investigation shows that multiple
instances of jobs appear to run on the system. Briefly jobA trys to run on
nodeX. nodeX, however reports 15041 and the job is sent elsewhere. The job
then starts to run on nodeY. Unfortunately despite the rejection of the job
on nodeX -- it does in fact run on that nodeX as well... With respect to
addressing this issue would you recommend upgrading to 1.1.0pX where X is
either 3 or 4?

Finally would an upgrade help to eliminate errors 15001 and 15004? What is
the theory and the experience of the community, please.

Thank you again -- David.


Quoting Dave Jackson <jacksond at supercluster.org>:

> David,
>
>   15004 failures indicate 'invalid values' being passed into some
> request.  They can occur with job holds, job dependencies, job
> submissions from invalid clients, attempt to start jobs in a routing
> queue, etc.
>
>   The latest torque 1.1.0p4 snapshot has logging enhancements which
> record the core reason for most of these failures.  Also, regarding
> failure code lookup, this can be found at
> http://clusterresources.com/torquedocs/2.1debugging.shtml or in
> pbs_error.h
>
>   If you could send us your pbs_server and/or pbs_mom logs, we can
> assist you further.  This would be most useful if the daemons were
> started with PBSLOGLEVEL set to 3 or highter.  Please include annotation
> describing what activity was occuring when the failure took place.
>
> Thanks,
> Dave
>
> On Thu, 2004-10-28 at 06:07, David Baker wrote:
> > Hi,
> >     We are currently setting up a medium sized (160 nodes) cluster
> based on
> > torque (1.1.0p0), and maui (3.2.6p7). We are finding that the node moms
> > report various error codes, and that we can not find any documentation
> or
> > helps on dealing with these conditions. The most problematic error is
> > 15004 -- the mom appears to be in a state of confusion, and rejects
> jobs
> > until the mom is restarted. Does anyone out there have an automated
> > procedure for preventing and/or dealing with this issue, please?
> >
> > Other error conditions we have seen are 15001, 15009 and 15029. In
> general
> > terms does supercluster or any users/group have access to any
> documentation
> > that might enable us to understand and control these conditions,
> please?
> >
> > Your advice and comments would be appreciated, please.
> > Thank you -- David Baker.
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>





More information about the torqueusers mailing list