Troubleshooting
ET/BWMGR Software and Appliance - Common Problems
Following are common problems and potential solutions with systems running
ET/BWMGR systems:
WARNING: Outside interface not set. Limiting Disabled.
After your bridge is configured and you have selected your default interface,
you must tell the BWMGR which bridge member
is the "outside", meaning the one that is connected to your router or
uplink. Click on the interface name, either from the main ET/BWMGR status page,
or from the rule listing page, and then select "Edit Interface".
Check the "Outside" checkbox, and then "Submit".
Loop Messages on console and/or in /var/log/messages
A bridge configuration depends on each MAC address on your network being
accessible via only one port of the bridge. A LOOP occurs when any MAC
address can be reached on both sides of the bridge. This is not necessarily
a problem if you get one or two isolated messages - especially during testing
when you may be moving machines around or plugging them into different ports.
If you see a screen full of these messages, this means that two or more bridged
ports on the appliance are plugged into the same switch or hub. Specifically,
the message tells you that the MAC address was received on both of the listed
interfaces. Constant looping can either halt your system or make it painfully
slow, and must be resolved. It indicates a serious flaw in your network setup.
In a Bridge configuration, packets cannot pass or
you get a lot of errors.
This can be caused by a number of problems.
(If you know you are getting errors on 1 or more interfaces, go to #2)
1) First, check your configuration. Using "bwmgr showbridges",
verify that the 2 ports are both in the same bridge group. Make certain that
the 2 devices (one on one side of the bridge and one on the other) are both
on the same logical network. Make sure you have no rules defined on either
interface. Also verify that only the primary bridge interface (shown by showbridges)
has an IP address on the logical network that you are bridging. Typically
secondary interfaces will have no IP address assigned. You should be able
to access the machine from both sides of the device (using a device on the
same logical network, of course). First try pinging on the primary interface
wire. Then the other. If neither work, then you most likely have a logical
setup problem.
2) If that doesnt work, check for errors on the interfaces. Use "netstat
-i" in FreeBSD or "ifconfig interface" in LINUX. If you are
getting errors when you try to pass data you may have a wiring problem. Using
crossover cables direct to Cisco equipment is a known problem area, as Ciscos
do not NWAY (ie negotiate links) correctly in general. If you are getting
errors, you can usually solve the problem by forcing the interface on the
switch and the ET/BWMGR system to the same setting. You can use ifconfig in
FreeBSD (see man interface for details on command and options), and mii-tool
in LINUX. If possible, try to use the bwmgr box setting and force the switch
or router. If that doesnt work, try to force both. If you can't get that to
work, you can put a small switch in between which with allow separate negotiation
by each device. We've found that a cheap switch can often solve the problem.
To set the interface in FreeBSD:
ifconfig fxp0 media 100baseTX mediaopt full-duplex
would set fxp0 interface to 100Mb/s, Full Duplex. See the fxp man page ('man
fxp' on the console) for a list of options.
To set the interface in LINUX:
mii-diag -F 100baseTx-FD eth0
would set eth0 to 100Mb/s Full Duplex
Bridge won't pass packets - System/Appliance
if you do NOT have a Failover-equipped appliance, one possibility is that
the secondary port is not connected: The primary ethernet port (eth0/fxp0)
is part of the motherboard and cannot come loose. The secondary port(s) are
located in PCI expansion card slots internally, and there is a small chance
that they may move enough during shipping to move out of the slot. If this
happens, typically the interface will not be shown in the system at all. From
the command line, you can issue the command "bwmgr showbridges".
If only eth0/fxp0 is listed, this is your likely culprit. You can also check
with the "ifconfig" command from the command line. For example:
ifconfig eth1 (on a LINUX system) or
ifconfig fxp1 (on a FreeBSD system) or
ifconfig dc0 (on a FreeBSD 5-port system)
If the system indicates that the device cannot be found, then the second
port (the ethernet card in the box) is probably unseated. If you suspect that
a board has become unseated, you need to take off the cover and reseat the
board. If you do so, make certain that you contact Emerging Technologies
support beforehand; otherwise you may void your warranty.
Note that this procedure is only required when the port cannot be found; if
the port is shown via ifconfig and the ET/ADMIN (Networking->Network Configuration->Configure
Interfaces) reseating the card should not be necessary.
If the interface is present, see above
ET/BWMGR is Limiting Too Much
If you have a limit set to (for example) 256000, and you can't get a local
application to use that much, these are the likely causes: One, you could
be losing packets. Check your interface for errors and look for drops on the
rule. You could also have a tcp window problem. Try using different settings for
tcpwindow to keep the window from being set too low. Try 5000 to start. A setting
of 64000 effectively disables window shaping. If you
still get overlimiting and you are running LINUX with a standalone license,
make sure you build a kernel on that machine. There have been reports that
running ET generated kernels on AMD Athlon CPUs result in some timing errors.
Rebuilding a custom kernel on the machine seems to fix the problem.
Registration Problems
"Server Down" message
If you get this message, it means that your system did not get a response
from our server. Registration requires a "handshake" on a udp
port in the 4000-5000 port range. Usually this occurs because the return
message from our server was blocked by a firewall. If support tells you
that your registration request was received by our server, then the problem
is on your end.
You can use tcpdump to try to debug the connection. On a separate screen,
run tcpdump on the interface with your IP address. Suppose its on bge0:
tcpdump -n -i bge0 host bwserver.etinc.com and udp
Now run the register command from the command line and you should see a packet to
our server and return. You should see output similar to the following:
08:22:25.686529 92.237.221.228.58728 > 207.252.75.245.4552: udp 217
08:22:25.734282 207.252.75.245.4552 > 92.237.221.228.58728: udp 30 [tos 0x20]
If you don't get a response, its likely because an upstream server is blocking the transaction, and you'll need to fix it. If you see a response, then our server is working. Check the addresses to make sure it comes
back from the same address and not an alias. If you're not sure what you're seeing, cut and
paste the output from tcpdump and post it on a support ticket.
"Invalid Key" message
This usually means that you are trying to register a key that can't be
registered, such as a 30 day test key.
"Invalid Argument" message
This usually means that you are trying to register or start a key
on an invalid interface, such as bw0. Make sure that the serial
number of the interface matches the key you are trying to register or
start.
mbuf clusters exhausted error message
This message is FreeBSD specific. If you get an error indicating that mbufs
are exhausted, it basically means that your system has run out of system memory
and there is no memory available to receive new packets. This can occur when
your system is receiving packets faster than it can process them for an extended
period of time. This can be caused by an attack, or it may just mean that
your system doesn't have enough memory allocated to the kernel to fufill the
settings. mbuf clusters are allocated on the fly, so just because you've allocated
20K buffers doesn't mean that there is enough memory to actually use that
many. Its difficult to tell exactly how much kernel memory is in use or how
much is left, so that best you can do is try to increase your kernel allocation.
Our appliances use an algorithm to decide how much memory to allocate based
on the amount of ram in the system and the most common requirements. You can
manually override this setting by placing a setting in the /boot/loader.conf
config file.
First, determine how much memory you have
Before you can decide how much to allocate, you need to know how much you
currently have, and how much is currently allocated. This info is displayed
(and saved in your /var/log/messages file) on system boot. You can snarf
it out with the following commands:
# grep Using /var/log/messages | tail
From this command, you should see a line like:
Kernel Using 150000000 bytes
now do:
# grep "real memory" /var/log/messages | tail
And you should get something like:
real memory = 528482304 (516096K bytes)
This tells you that you have 512K of RAM in the system.
Next, set the kernel RAM allocation
Typically, you only have 1 user on a bandwidth management appliance at
a time, so you don't need a lot of user memory unless you're also using
the system as a server of if you're running squid. So on this system, we
can set the kernel allocation much higher. To set it to 300M (not K), we
can put the following line into /boot/loader.conf:
kern.vm.kmem.size="300000000"
Then you'll need to reboot the system. You've just doubled the amount of
memory available to the bandwidth management application.
What if you still get the "mbuf clusters exhausted" message?
If you still get the message, make sure you have your bandwidth manager
"maxbuffers" setting several hundred buffers below your clusters
settting. You can get your clusters setting with the following command:
sysctl -a | grep clusters
which should yield something like:
kern.ipc.nmbclusters: 20000
At the time of this writing, 20000 was the default setting. You should
have your maxbuffers set to 150000 or so. If this is the case, and you still
get the error message, we recommend that you increate your clusters by 20%.
You can do this by inserting a line if your /boot/loader.conf:
kern.ipc.nmbclusters="24000"
then reboot the system.
Graph Problems
ET/BWMGR v3.33c and older
If you have a problem viewing graphs in version 3.33c or below, there is likely some problem interfacing
to your database or a configuration problem. If the graph is not created or you get a broken graphic
symbol on your browser, then you should check your HTTP_ROOT and default graph
directory in your ET/BWMGR defaults settings. You can also view the HTML and check the URL that the system is building
from the info in your defaults.
ET/BWMGR version 4.0 and newer
Version 4.0 uses a new graph package so there are additional things that you may have to check. If you have a broken icon, you can right-click on the icon and "open image in new window". This should print out any error messages. Usually the error is fairily self-explanitory. If you get a "Server cannot be found", then you have to problem with your web server. First, check to see if apache is running on your system. You can just start it and it will complain if its already running:
# apachectl start
If apache is running and you still can't access the server, then you'll have to debug the connection. Check the address in the browser window, and check your httpd config file, which is /usr/local/www/conf/httpd.conf. By default, your document root should be /usr/local/www/et. If you find that you need to make changes to your config file, you'll need to restart apache with the following command:
# apachectl restart
Problems with bwmgrd on all versions
If you have empty graphs and you are getting hits on the rules that are supposed to be graphed,
then the data is not being put into the database for one reason or another. Things to check:
- On the main GUI page, bwmgrd status is shown. This needs to be "Running".
- If bwmgrd is not running, go to Administration->System and Server Status and click on Bwmgrd Stats Daemon and check the current status for the reason.The status only shows that last reported error. For more extensive analysis, you'll
need to check the /var/log/bwmgrd.log for additional errors. From the console:
# tail -25 /var/log/bwmgrd.log
will show the last 25 lines in the log. If you have a lot of errors
is started, check your settings using the "edit defaults" button from the GUI
- When bwmgr starts it prints a startup message in the log. So look for the last start and look at messages after that. Sometimes the error message will tell you what's wrong right away. Some common messages:
Cannot Open MySQL Database
If the database can't be opened, then either mySQL isn't running, or you have a problem with your permissions. You can check mySQL the same way you checked to see if bwmgrd was running in System and Server Status. If mySQL is running, check your defaults and check your database settings and password.
Can't insert system info: Duplicate entry '1226272680' for key 'PRIMARY'
A duplicate entry message usually means that the same time has occurred twice. This can happen when you set the clock back on your system. This is an issue during daylight savings, but there is no workaround. If you only get this once (or can resolve it to a change of the system clock), then you can ignore this. If you get duplicate errors constantly you may have multiple instanced of bwmgrd running.
Failed to Add Data Record (mail): Table './etbwmgr/bwdata' is marked as crashed and should be repaired
If you are getting errors like the above, you likely need to repair your database. The procedure is described below.
MySQL Problems and Database Repair
Before repairing the MySQL database, it's good to have an idea of what
the problem is. If you have already checked /var/log/bwmgrd.log, then
you may already have identified the error. If not, you should look at
/usr/local/var/mysql/HOSTNAME.err (HOSTNAME should be replaced with the
hostname you have assigned your appliance.)
On appliances, there is a command-line utility that will attempt to
automatically repair the database. You must be the super-user to run this
utility.
# fixdb
The 'fixdb' command will shut down bwmgrd and the MySQL database, and attempt
to repair your database tables. MySQLd will then be restarted. This can be a slow
process, especially with large databases. When the process is complete, if
the repair was successful you should see the following line:
"Starting mysqld daemon with databases from /usr/local/var/mysql"
If you see this line and no further error messages, then you can then re-start
bwmgrd.
# /usr/local/sbin/bwmgrd
If you continue to have database problems after running 'fixdb', then use
the manual method below:
First, change your directory to the location of the 'etbwmgr' database files,
and list the files.
#cd /usr/local/var/mysql/etbwmgr
#ls -la
drwx------ 2 mysql mysql 512 Oct 11 14:55 .
drwx------ 4 mysql mysql 512 Oct 14 11:24 ..
-rw-rw---- 1 mysql mysql 107696 Oct 14 14:00 bwdata.MYD
-rw-rw---- 1 mysql mysql 24576 Oct 14 14:00 bwdata.MYI
-rw-rw---- 1 mysql mysql 9042 Oct 11 14:55 bwdata.frm
-rw-rw---- 1 mysql mysql 67 Oct 14 14:00 markers.MYD
-rw-rw---- 1 mysql mysql 2048 Oct 14 11:25 markers.MYI
-rw-rw---- 1 mysql mysql 8710 Oct 11 14:55 markers.frm
You should see a listing similar to the above, although the filesizes will
be different. If you do not have the same files, and instead see "bwdata.ISD"
and "bwdata.ISM", then instead of running "myisamchk"
in the below examples, you must run "isamchk" instead.
The next step is to check your database for errors. Below is the output from
an uncorrupted database.
#myisamchk bwdata
Checking MyISAM file: bwdata
Data records: 849 Deleted blocks: 0
- check file-size
- check key delete-chain
- check record delete-chain
- check index reference
- check data record references index: 1
- check data record references index: 2
- check data record references index: 3
Please note that if you see the
following lines, this does NOT indicate a serious database corruption.
myisamchk: warning: 1 clients is using or hasn't closed the table properly
MyISAM-table 'bwdata' is usable but should be fixed
The key line to look for in order to determine whether a repair is needed
is the last two lines of output:
"MyISAM-table 'bwdata' is corrupted
Fix it using switch "-r" or "-o"
If you see errors listed, the next step is to attempt repair.
First, shut down bwmgrd and the MySQL server:
# killall bwmgrd
# mysqladmin -p -u root shutdown (you will
be prompted for the password to complete this step.)
Next, backup the /usr/local/var/mysql directory manually or using the "Backup"
feature of the ET/Admin.
Then, begin the repair operation. If this fails, it may not be possible to
recover any information, unless the failure yields more information about
the underlying problem.
# myisamchk -r bwdata
If you have an appliance or your mySQL distribution is built using /var as
the default directory, you may not have enough space in the partition to repair
your database. In this case, create a temp directory in your /usr partition
if you don't already have one and specify it as the temp directory as follows
#mkdir /usr/local/temp
#myisamchk --tmpdir=/usr/local/temp -r bwdata
If the repair is successful then you will be able to restart the MySQL server
and you are done. If the repair is not successful your only reliable option
is a restore from your last backup or to re-create an empty database (You
do have backups, right?).
# /usr/local/bin/safe_mysqld --user=mysql &
(Restart the MySQL server.)
# /usr/local/sbin/bwmgrd (Restart bwmgrd.)
If you see big spikes in your Graph Data when rebooting
If you see big spikes in your graphs that correspond to a reboot, it probably
means that bwmgrd was started before mysqld, either because you started them
in the wrong order or because of timing issues with system threads. Make certain
that you start mysqld before bwmgrd, and then you allow at least 2 seconds in
between for mysqld to get its act together. You can do this with a "sleep",
or by running something else in-between.
You'll need to manually remove the "spikes" from the database with
an SQL DELETE. Figure out approximately what the data size is (from the graphs,
noting a duration of 300 seconds) and look for data that far exceeds a normal
reading for the graph. So, for example, if the normal high reading is 120kbs
for incoming data, that equates to a "bytes_in" setting of about 4.5
million bytes. (120,000/8 * 300). You could search the database on the given
date for values over 8 million, and you should be able to locate your spike
data. Just delete the row, as the reading is invalid.
"Can't Get Statistics" Error Messages
If you see "can't get statistics for rulename" messages
in your /var/log/bwmgrd.log file, it means that a rule was deleted (or failed
on startup) that the statistical system still thinks should be there. When you
delete a rule gracefully from the GUI, the marker file that bwmgrd looks for
is removed. If the rule was removed purposefully, you can get rid of this message
by deleting the associated file in the /usr/local/etc/bwmgr/config directory.
What to do if your system/appliance doesn't power
up properly
If the unit is completely unresponsive (ie, no fan noise, nothing on the screen,
no beeping), check all power connections, as well as all switches. The ET/R1500
series have a main power switch on the power supply as well as a "power-on"
switch on the front panel. For users with the 2U enclosure option, make sure
that you have the correct voltage (see
power supply requirements). If the outlet and power cord test good (test
on a monitor or other appliance with a standard AC input), and there's still
absolutely no response from the unit, contact Emerging Technologies for technical
support or RMA service.
If the unit powers up, but freezes before the OS boots, then it's possible that
the CPU fan/heatsink has popped off its mount during shipping. Please make a
note of what's on the monitor, then power-off the machine and contact Emerging
Technologies' technical support.
(LINUX Only) If the boot stops at the message "Starting system logger:",
this is likely due to an incorrect or missing DNS setup. You must wait for the
current program (syslogd) to timeout while trying to get the hostname. This
may take up to 3 minutes, so be patient. Once the machine has booted, make
sure DNS is enabled and setup properly.
If the unit displays the power-on self-test (POST), but does not find a bootable
device, it's possible (although very unlikely) that the IDE cable has come loose
from either the hard drive or the motherboard. Please notify Emerging Technologies'
support staff before opening the box!
If the monitor remains blank, but the fans start and you hear a series of beeps
from the unit, this indicates a problem with the memory. The ET/R1500 units
use standard DIMM RAM modules. Either the module has come loose from its seating,
or has failed completely. Contact Emerging Technologies' support staff before
opening the box and attempting to re-seat the RAM. If re-seating the RAM does
not work, we will likely issue an RMA.
Disaster Recovery
This section deals with a situation wherein your appliance does not boot, either
due to a crash that fsck (the UNIX "chkdsk" or "scandisk"
equivalent) cannot deal with gracefully, or a panic during the boot process.
In either case, you can either use the ET/Recovery CD to fix the problem, or
take manual control of the appliance at boot time.
If you do not have a Recovery CD, then you must follow the step-by-step instructions
below. If you do have a CD, boot the appliance with the CD-ROM in the drive,
and use the "fix" command to repair and mount the appliance filesystems
automatically. If you are experiencing a panic, you can make the necessary changes
after running "fix", since the appliance filesystem will be accessible
in the "/mnt" directory. See the ET/Recovery Manual for more information.
Manual Instructions:
FreeBSD
Hit "F1" at the boot menu to select FreeBSD. After a few seconds,
you will see the text "kernel= " as the
kernel is loaded, followed by a 3-second countdown. Press the spacebar (or any
key besides enter) to interrupt the boot. You will then see a "boot>"
prompt. Enter the following command to boot into single-user mode:
boot> boot -s
Alternately, if you are loading a debug kernel, you must instead do this:
boot> unload
boot> load kernel.dbg
boot> boot
This will load the debug kernel for a single boot.
You will be prompted to enter the shell for root, if you are entering single-user
mode. Simply hit "enter" to accept the default of /bin/sh. Now you
should have a root prompt - key in the following series of commands:
# /sbin/fsck -y /
# /sbin/fsck -y /var
# /sbin/fsck -y /usr
This last command should take a few minutes to complete, at which time you
can either continue the boot, or you can make appropriate changes to your startup files.
If you need to make any changes, you must first enable read/write access to your filesystems:
# mount -a
If you know exactly what is causing the problem, then you can take specific action to fix it.
If you suspect a BWMGR rule is causing problems, but don't know which one, then you can bypass starting the ET/BWMGR like this:
# mv /etc/rc.bwmgr /etc/rc.bwmgr.sav
# mv /etc/rc.bridge /etc/rc.bridge.save
# exit
LINUX:
Hit "F2" at the boot menu to select Linux. Next will appear the "LILO:"
prompt. Type " linux -s " at the prompt and press enter.
You may have to enter the root password to get a shell prompt. At the prompt,
type the following commands:
# /sbin/fsck -y /
# /sbin/fsck -y /usr
This last command should take a few minutes to complete, at which time you
can either continue the boot, or you can make appropriate changes to your startup files.
If you need to make any changes, you must first enable read/write access to your filesystems:
# mount -a
If you know exactly what is causing the problem, then you can take specific action to fix it.
If you suspect a BWMGR rule is causing problems, but don't know which one, then you can bypass starting the ET/BWMGR like this:
# mv /etc/rc.bwmgr /etc/rc.bwmgr.sav
# mv /etc/rc.bridge /etc/rc.bridge.save
# exit
Hopefully you will be able to boot after performing this procedure. If not,
please contact Emerging Technologies for technical assistance.