[torqueusers] Understanding & dealing with torque error codes
D.J.Baker at soton.ac.uk
Thu Oct 28 06:07:50 MDT 2004
We are currently setting up a medium sized (160 nodes) cluster based on
torque (1.1.0p0), and maui (3.2.6p7). We are finding that the node moms
report various error codes, and that we can not find any documentation or
helps on dealing with these conditions. The most problematic error is
15004 -- the mom appears to be in a state of confusion, and rejects jobs
until the mom is restarted. Does anyone out there have an automated
procedure for preventing and/or dealing with this issue, please?
Other error conditions we have seen are 15001, 15009 and 15029. In general
terms does supercluster or any users/group have access to any documentation
that might enable us to understand and control these conditions, please?
Your advice and comments would be appreciated, please.
Thank you -- David Baker.
More information about the torqueusers