[torqueusers] Understanding & dealing with torque error codes

Jeffrey J. Barteet barteet at mrl.ucsb.edu
Thu Oct 28 10:52:45 MDT 2004


Hey, David,

I'm using a distribution of torque from last winter when it was briefly called
spbs.

In my source distribution, the file:

/spbs-1.0.1/src/include/pbs_error.h

contains the defintions to errors you receive in log files, and I've found it
valuable in debugging issues like this.

This portion of the file defines the errors you were seeing:

#define PBSE_NONE       0               /* no error */
#define PBSE_UNKJOBID   15001           /* Unknown Job Identifier */
#define PBSE_NOATTR     15002           /* Undefined Attribute */
#define PBSE_ATTRRO     15003           /* attempt to set READ ONLY attribute */
#define PBSE_IVALREQ    15004           /* Invalid request */
#define PBSE_UNKREQ     15005           /* Unknown batch request */
#define PBSE_TOOMANY    15006           /* Too many submit retries */
#define PBSE_PERM       15007           /* No permission */
#define PBSE_BADHOST    15008           /* access from host not allowed */
#define PBSE_JOBEXIST   15009           /* job already exists */
#define PBSE_SYSTEM     15010           /* system error occurred */
#define PBSE_INTERNAL   15011           /* internal server error occurred */
#define PBSE_REGROUTE   15012           /* parent job of dependent in rte que */
#define PBSE_UNKSIG     15013           /* unknown signal name */
#define PBSE_BADATVAL   15014           /* bad attribute value */
#define PBSE_MODATRRUN  15015           /* Cannot modify attrib in run state */
#define PBSE_BADSTATE   15016           /* request invalid for job state */
#define PBSE_UNKQUE     15018           /* Unknown queue name */
#define PBSE_BADCRED    15019           /* Invalid Credential in request */
#define PBSE_EXPIRED    15020           /* Expired Credential in request */
#define PBSE_QUNOENB    15021           /* Queue not enabled */
#define PBSE_QACESS     15022           /* No access permission for queue */
#define PBSE_BADUSER    15023           /* Bad user - no password entry */
#define PBSE_HOPCOUNT   15024           /* Max hop count exceeded */
#define PBSE_QUEEXIST   15025           /* Queue already exists */
#define PBSE_ATTRTYPE   15026           /* incompatable queue attribute type */
#define PBSE_QUEBUSY    15027           /* Queue Busy (not empty) */
#define PBSE_QUENBIG    15028           /* Queue name too long */
#define PBSE_NOSUP      15029           /* Feature/function not supported */
#define PBSE_QUENOEN    15030           /* Cannot enable queue,needs add def */
#define PBSE_PROTOCOL   15031           /* Protocol (ASN.1) error */
#define PBSE_BADATLST   15032           /* Bad attribute list structure */
#define PBSE_NOCONNECTS 15033           /* No free connections */
#define PBSE_NOSERVER   15034           /* No server to connect to */
#define PBSE_UNKRESC    15035           /* Unknown resource */
#define PBSE_EXCQRESC   15036           /* Job exceeds Queue resource limits */
#define PBSE_QUENODFLT  15037           /* No Default Queue Defined */
#define PBSE_NORERUN    15038           /* Job Not Rerunnable */
#define PBSE_ROUTEREJ   15039           /* Route rejected by all destinations */
#define PBSE_ROUTEEXPD  15040           /* Time in Route Queue Expired */
#define PBSE_MOMREJECT  15041           /* Request to MOM failed */

Good luck,

-jeffrey


Quoting David Baker <D.J.Baker at soton.ac.uk>:

> Hi,
>     We are currently setting up a medium sized (160 nodes) cluster based on
> torque (1.1.0p0), and maui (3.2.6p7). We are finding that the node moms
> report various error codes, and that we can not find any documentation or
> helps on dealing with these conditions. The most problematic error is
> 15004 -- the mom appears to be in a state of confusion, and rejects jobs
> until the mom is restarted. Does anyone out there have an automated
> procedure for preventing and/or dealing with this issue, please?
>
> Other error conditions we have seen are 15001, 15009 and 15029. In general
> terms does supercluster or any users/group have access to any documentation
> that might enable us to understand and control these conditions, please?
>
> Your advice and comments would be appreciated, please.
> Thank you -- David Baker.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>


Jeffrey J. Barteet
Systems Administrator
Materials Research Laboratory
University of California, Santa Barbara


More information about the torqueusers mailing list