Deployment Troubleshooting

The Moab Cluster Deployment Wizard does not start up automatically after the head node installation.

  • The Moab Cluster Deployment Wizard icon found on the desktop can be clicked to start the wizard. If the icon does not exist, the head node installation was not fully successful.
  • Try reinstalling the head node.
  • If reinstalling the head node does not work, the wizard can be manually run via the command found at /usr/share/mcb/mcb.

I accidently configured the incorrect number of compute nodes and/or racks in the Moab Cluster Deployment Wizard.

  • Run the wizard again specifying the correct numbers. The wizard will reconfigure your cluster.

The Moab Cluster Deployment Wizard's advanced layout section is disabled.

  • Three required values are needed to enable the advanced layout section: (1) the network interface, (2) number of racks, and (3) number of nodes for each rack.

The Moab Cluster Deployment Wizard's cluster configuration fails.

  • The /var/log/mcb/fatal file may give information that explains the problem.
  • If one or more services did not start during configuration, look at the logs found at /var/log/mcb/services_logs.
  • See the Logging section for more details.

The Moab Cluster Deployment Wizard's compute node booting fails.

  • Check to make sure network cables are plugged in.
  • Check your firewall settings and atftpd configuration. Refer to the SUSE manual for more information on how to do this.
  • Make sure that the compute node is set to boot from network (PXE) in BIOS, or that you are manually overriding the boot order to boot from the network.
  • Make sure the head node's network interface that was configured to communicate with the compute nodes is accurate. For example, the head node's 'eth1' interface is the network interface used to communicate with the compute nodes, although 'eth0' was configured in the wizard.
  • If you cannot use PXE for any reason, there is a tool called Etherboot/gPXE that will allow you to boot from a CD/floppy and then load the network booter. This tool can be found at http://www.etherboot.org.

The Moab Cluster Deployment Wizard's diagnostics fail.

  • Make sure the scheduler's queue(s) are empty. For the tests to succeed, jobs cannot be running at test time.
  • Check Moab's queue(s) after the diagnostics have finished via the showq command. If there are any jobs remaining in the queue(s), see why they failed or are still running. This can be accomplished by issuing the command checkjob -v <job_id>.
  • Make sure that Moab is able to see all processors reported by TORQUE. For example, if TORQUE is reporting two nodes with two processors each, make sure Moab detects all processors. This can be accomplished by issuing the command checknode -v <node_name>. If the "classes" value only reports one processor, restart Moab and check the results again.

Logging

Logging found at /var/log/mcb/ may help diagnose a problem that occurs while using the wizard. The fatal file may contain information that caused the error. The actions file describes all action standard error messages reported during the configuration. The copy_errors file reports any failures to copy the installation files from the installation source onto the head node. The mcb.log file(s) report logging information from the graphical tool itself.

Diagnostics Logging

Diagnostics logs are found in subdirectories of /var/mcb/log. These subdirectories are recognized by their timestamp. The timestamp represents the date and time that the tests are started. For example, a folder you would find for diagnostics executed at 12:00 p.m. on February 1st, 2008 would be located in the directory 2008-02-01_12.00.00. Inside the directory are files for each test run signified by their process number, test number, and test description. For example, if the environment test was run first and had a PID of 20000, the corresponding file would be 20000_00_environment.t.log. The file with the output from each test is the *_diagnose.pl.log file. This file can tell you exactly why a particular test failed.

Support

If this deployment troubleshooting section does not answer your question, please contact escalante-support@clusterresources.com. To better help you, please attach the output of the mcb_make_report utility.