[torqueusers] Understanding & dealing with torque error codes

Dave Jackson jacksond at supercluster.org
Thu Oct 28 10:51:43 MDT 2004


David,

  15004 failures indicate 'invalid values' being passed into some
request.  They can occur with job holds, job dependencies, job
submissions from invalid clients, attempt to start jobs in a routing
queue, etc.

  The latest torque 1.1.0p4 snapshot has logging enhancements which
record the core reason for most of these failures.  Also, regarding
failure code lookup, this can be found at
http://clusterresources.com/torquedocs/2.1debugging.shtml or in
pbs_error.h

  If you could send us your pbs_server and/or pbs_mom logs, we can
assist you further.  This would be most useful if the daemons were
started with PBSLOGLEVEL set to 3 or highter.  Please include annotation
describing what activity was occuring when the failure took place.

Thanks,
Dave

On Thu, 2004-10-28 at 06:07, David Baker wrote:
> Hi,
>     We are currently setting up a medium sized (160 nodes) cluster based on
> torque (1.1.0p0), and maui (3.2.6p7). We are finding that the node moms
> report various error codes, and that we can not find any documentation or
> helps on dealing with these conditions. The most problematic error is
> 15004 -- the mom appears to be in a state of confusion, and rejects jobs
> until the mom is restarted. Does anyone out there have an automated
> procedure for preventing and/or dealing with this issue, please?
> 
> Other error conditions we have seen are 15001, 15009 and 15029. In general
> terms does supercluster or any users/group have access to any documentation
> that might enable us to understand and control these conditions, please?
> 
> Your advice and comments would be appreciated, please.
> Thank you -- David Baker.
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list