Deployment Troubleshooting
The Moab Cluster Deployment Wizard does not start up automatically
after the head node installation.
- The Moab Cluster Deployment Wizard icon found on the desktop
can be clicked to start the wizard. If the icon does not exist, the
head node installation was not fully successful.
- Try reinstalling the head node.
- If reinstalling the head node does not work, the wizard can be manually
run via the command found at /usr/share/mcb/mcb.
I accidently configured the incorrect number of compute nodes and/or
racks in the Moab Cluster Deployment Wizard.
- Run the wizard again specifying the correct numbers. The wizard will
reconfigure your cluster.
The Moab Cluster Deployment Wizard's advanced layout section is disabled.
- Three required values are needed to enable the advanced layout section:
(1) the network interface, (2) number of racks, and (3) number of nodes for each
rack.
The Moab Cluster Deployment Wizard's cluster configuration fails.
- The /var/log/mcb/fatal file may give information that explains
the problem.
- If one or more services did not start during configuration, look at
the logs found at /var/log/mcb/services_logs.
- See the Logging section for more details.
The Moab Cluster Deployment Wizard's compute node booting fails.
- Check to make sure network cables are plugged in.
- Check your firewall settings and atftpd configuration. Refer to the
SUSE manual for more information on how to do this.
- Make sure that the compute node is set to boot from network (PXE)
in BIOS, or that you are manually overriding the boot order to boot
from the network.
- Make sure the head node's network interface that was configured to
communicate with the compute nodes is accurate. For example, the head node's
'eth1' interface is the network interface used to communicate with
the compute nodes, although 'eth0' was configured in the wizard.
- If you cannot use PXE for any reason, there is a tool called Etherboot/gPXE
that will allow you to boot from a CD/floppy and then load the network
booter. This tool can be found at http://www.etherboot.org.
The Moab Cluster Deployment Wizard's diagnostics fail.
- Make sure the scheduler's queue(s) are empty. For the tests
to succeed, jobs cannot be running at test time.
- Check Moab's queue(s) after the diagnostics have finished via the
showq command. If there are any jobs remaining in the queue(s),
see why they failed or are still running. This can be accomplished
by issuing the command checkjob -v <job_id>.
- Make sure that Moab is able to see all processors reported by TORQUE.
For example, if TORQUE is reporting two nodes with two processors
each, make sure Moab detects all processors. This can be accomplished
by issuing the command checknode -v <node_name>.
If the "classes" value only reports one processor, restart Moab
and check the results again.
Logging
Logging found at /var/log/mcb/ may help diagnose a problem
that occurs while using the wizard. The fatal file may contain
information that caused the error. The actions file describes
all action standard error messages reported during the configuration.
The copy_errors file reports any failures to copy the installation
files from the installation source onto the head node. The mcb.log
file(s) report logging information from the graphical tool itself.
Diagnostics Logging
Diagnostics logs are found in subdirectories of /var/mcb/log.
These subdirectories are recognized by their timestamp. The timestamp
represents the date and time that the tests are started. For example,
a folder you would find for diagnostics executed at 12:00 p.m. on February
1st, 2008 would be located in the directory 2008-02-01_12.00.00.
Inside the directory are files for each test run signified by their
process number, test number, and test description. For example, if
the environment test was run first and had a PID of 20000,
the corresponding file would be 20000_00_environment.t.log.
The file with the output from each test is the *_diagnose.pl.log file. This file can tell you exactly why a particular test failed.
Support
If this deployment troubleshooting section does not answer your question, please contact escalante-support@clusterresources.com. To better
help you, please attach the output of the mcb_make_report utility.
|