FortiGate HA-Cluster Troubleshooting using Checksums
Comparing checksums of cluster units
You can use thediagnose sys ha checksum show
command to compare the configuration checksums of all cluster units. The output of this command shows checksums labeled global
and all
as well as checksums for each of the VDOMs including the root
VDOM. The get system ha-nonsync-csum
command can be used to display similar information; however, this command is intended to be used by FortiManager.The primary unit and subordinate unit checksums should be the same. If they are not you can use the
execute ha synchronize start
command to force a synchronization.The following command output is for the primary unit of a cluster that does not have multiple VDOMs enabled:
diagnose sys ha checksum show
is_manage_master()=1, is_root_master()=1
debugzone
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5
checksum
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5
diagnose sys ha checksum show
is_manage_master()=0, is_root_master()=0
debugzone
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5
checksum
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5
test
and Eng_vdm
.From the primary unit:
config global
diagnose sys ha checksum show
is_manage_master()=1, is_root_master()=1
debugzone
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53
checksum
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53
config global
diagnose sys ha checksum show
is_manage_master()=0, is_root_master()=0
debugzone
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53
checksum
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53
How to diagnose HA out of sync messages
This section describes how to use thediagnose sys ha checksum show
and diagnose debug
commands to diagnose the cause of HA out of sync messages.If HA synchronization is not successful, use the following procedures on each cluster unit to find the cause.
To determine why HA synchronization does not occur
- Connect to each cluster unit CLI by connected to the console port.
- Enter the following commands to enable debugging and display HA out of sync messages.
diagnose debug enable
diagnose debug console timestamp enable
diagnose debug application hatalk -1
diagnose debug application hasync -1
Collect the console output and compare the out of sync messages with the information in the table HA out of sync object messages and the configuration objects that they reference.
- Enter the following commands to turn off debugging.
diagnose debug disable
diagnose debug reset
To determine what part of the configuration is causing the problem
If the previous procedure displays messages that include sync object 0x30 (for example,HA_SYNC_SETTING_CONFIGURATION = 0x03
)
there is a synchronization problem with the configuration. Use the
following steps to determine the part of the configuration that is
causing the problem.If your cluster consists of two cluster units, use this procedure to capture the configuration checksums for each unit. If your cluster consists of more that two cluster units, repeat this procedure for all cluster units that returned messages that include 0x30 sync object messages.
- Connect to each cluster unit CLI by connected to the console port.
- Enter the following command to turn on terminal capture
diagnose debug enable
- Enter the following command to stop HA synchronization.
execute ha sync stop
- Enter the following command to display configuration checksums.
diagnose sys ha checksum show global
- Copy the output to a text file.
- Repeat for all affected units.
- Compare the text file from the primary unit with the text file from each cluster unit to find the checksums that do not match.
You can use a diff function to compare text files.
- Repeat for the root VDOM:
diagnose sys ha checksum show root
- Repeat for all VDOMS (if multiple VDOM configuration is enabled):
- You can also use the
grep
option to just display checksums for parts of the configuration.
diagnose sys ha checksum show <vdom-name>
For example to display system related configuration checksums in the root VDOM or log-related checksums in the global configuration:
diagnose sys ha checksum root | grep system
diagnose sys ha chechsum global | grep log
Generally it is the first non-matching checksum that is the cause of the synchronization problem.
- Attempt to remove/change the part of the configuration that is causing the problem. You can do this by making configuration changes from the primary unit or subordinate unit CLI.
- Enter the following commands to start HA configuration and stop debugging:
execute ha sync start
diagnose debug disable
diagnose debug reset
Console messages when configuration synchronization fails
If you connect to the console of a subordinate unit that is out of synchronization with the primary unit, messages similar to the following are displayed.
slave is not in sync with master, sequence:0. (type 0x3)
slave is not in sync with master, sequence:1. (type 0x3)
slave is not in sync with master, sequence:2. (type 0x3)
slave is not in sync with master, sequence:3. (type 0x3)
slave is not in sync with master, sequence:4. (type 0x3)
global compared not matched
type 0x3
). The type value can help Fortinet Support diagnose the synchronization problem.HA out of sync object messages and the configuration objects that they reference
Out of Sync Message | Configuration Object |
---|---|
HA_SYNC_SETTING_CONFIGURATION = 0x03
|
/data/config
|
HA_SYNC_SETTING_AV = 0x10
|
|
HA_SYNC_SETTING_VIR_DB = 0x11
|
/etc/vir
|
HA_SYNC_SETTING_SHARED_LIB = 0x12
|
/data/lib/libav.so
|
HA_SYNC_SETTING_SCAN_UNIT = 0x13
|
/bin/scanunitd
|
HA_SYNC_SETTING_IMAP_PRXY = 0x14
|
/bin/imapd
|
HA_SYNC_SETTING_SMTP_PRXY = 0x15
|
/bin/smtp
|
HA_SYNC_SETTING_POP3_PRXY = 0x16
|
/bin/pop3
|
HA_SYNC_SETTING_HTTP_PRXY = 0x17
|
/bin/thttp
|
HA_SYNC_SETTING_FTP_PRXY = 0x18
|
/bin/ftpd
|
HA_SYNC_SETTING_FCNI = 0x19
|
/etc/fcni.dat
|
HA_SYNC_SETTING_FDNI = 0x1a
|
/etc/fdnservers.dat
|
HA_SYNC_SETTING_FSCI = 0x1b
|
/etc/sci.dat
|
HA_SYNC_SETTING_FSAE = 0x1c
|
/etc/fsae_adgrp.cache
|
HA_SYNC_SETTING_IDS = 0x20
|
/etc/ids.rules
|
HA_SYNC_SETTING_IDSUSER_RULES = 0x21
|
/etc/idsuser.rules
|
HA_SYNC_SETTING_IDSCUSTOM = 0x22
|
|
HA_SYNC_SETTING_IDS_MONITOR = 0x23
|
/bin/ipsmonitor
|
HA_SYNC_SETTING_IDS_SENSOR = 0x24
|
/bin/ipsengine
|
HA_SYNC_SETTING_NIDS_LIB = 0x25
|
/data/lib/libips.so
|
HA_SYNC_SETTING_WEBLISTS = 0x30
|
|
HA_SYNC_SETTING_CONTENTFILTER = 0x31
|
/data/cmdb/webfilter.bword
|
HA_SYNC_SETTING_URLFILTER = 0x32
|
/data/cmdb/webfilter.urlfilter
|
HA_SYNC_SETTING_FTGD_OVRD = 0x33
|
/data/cmdb/webfilter.fgtd-ovrd
|
HA_SYNC_SETTING_FTGD_LRATING = 0x34
|
/data/cmdb/webfilter.fgtd-ovrd
|
HA_SYNC_SETTING_EMAILLISTS = 0x40
|
|
HA_SYNC_SETTING_EMAILCONTENT = 0x41
|
/data/cmdb/spamfilter.bword
|
HA_SYNC_SETTING_EMAILBWLIST = 0x42
|
/data/cmdb/spamfilter.emailbwl
|
HA_SYNC_SETTING_IPBWL = 0x43
|
/data/cmdb/spamfilter.ipbwl
|
HA_SYNC_SETTING_MHEADER = 0x44
|
/data/cmdb/spamfilter.mheader
|
HA_SYNC_SETTING_RBL = 0x45
|
/data/cmdb/spamfilter.rbl
|
HA_SYNC_SETTING_CERT_CONF = 0x50
|
/etc/cert/cert.conf
|
HA_SYNC_SETTING_CERT_CA = 0x51
|
/etc/cert/ca
|
HA_SYNC_SETTING_CERT_LOCAL = 0x52
|
/etc/cert/local
|
HA_SYNC_SETTING_CERT_CRL = 0x53
|
/etc/cert/crl
|
HA_SYNC_SETTING_DB_VER = 0x55
|
|
HA_GET_DETAIL_CSUM = 0x71
|
|
HA_SYNC_CC_SIG = 0x75
|
/etc/cc_sig.dat
|
HA_SYNC_CC_OP = 0x76
|
/etc/cc_op
|
HA_SYNC_CC_MAIN = 0x77
|
/etc/cc_main
|
HA_SYNC_FTGD_CAT_LIST = 0x7a
|
/migadmin/webfilter/ublock/ftgd/ data/ |
Synchronizing the configuration
The FGCP uses a combination of incremental and periodic synchronization to make sure that the configuration of all cluster units is synchronized to that of the primary unit.The following settings are not synchronized between cluster units:
- HA override.
- HA device priority.
- The virtual cluster priority.
- The FortiGate host name.
- The HA priority setting for a ping server (or dead gateway detection) configuration.
- The system interface settings of the HA reserved management interface.
- The HA default route for the reserved management interface, set using the
ha-mgmt-interface-gateway
option of theconfig system ha
command.
All synchronization activity takes place over the HA heartbeat link using TCP/703 and UDP/703 packets.
Recalculating the checksums to resolve out of sync messages
Sometimes an error can occur when checksums are being calculated by the cluster. As a result of this calculation error the CLI console could display out of sync error messages even though the cluster is otherwise operating normally. You can also sometimes see checksum calculation errors indiagnose sys ha checksum
command output when the checksums listed in the debugzone
output don’t match the checksums in the checksum
part of the output.One solution to this problem could be to re-calculate the checksums. The re-calculated checksums should match and the out of sync error messages should stop appearing.
You can use the following command to re-calculate HA checksums:
diagnose sys ha checksum recalculate [<vdom-name> | global]
Just entering the command
without options recalculates all checksums. You can specify a VDOM name
to just recalculate the checksums for that VDOM. You can also enter global
to recalculate the global checksum.Disabling automatic configuration synchronization
In some cases you may want to use the following command to disable automatic synchronization of the primary unit configuration to all cluster units.
config system ha
set sync-config disable
end
When this option is disabled the cluster no longer synchronizes configuration changes. If a device failure occurs, the new primary unit may not have the same configuration as the failed primary unit. As a result, the new primary unit may process sessions differently or may not function on the network in the same way.
In most cases you should not disable automatic configuration synchronization. However, if you have disabled this feature you can use the
execute ha synchronize
command to manually synchronize a subordinate unit’s configuration to that of the primary unit.You must enter
execute ha synchronize
commands from the subordinate unit that you want to synchronize with the primary unit. Use the execute ha manage
command to access a subordinate unit CLI.For example, to access the first subordinate unit and force a synchronization at any time, even if automatic synchronization is disabled enter:
execute ha manage 0
execute ha synchronize start
You can use the following command to stop a synchronization that is in progress.
execute ha synchronize stop
Incremental synchronization
When you log into the cluster GUI or CLI to make configuration changes, you are actually logging into the primary unit. All of your configuration changes are first made to the primary unit. Incremental synchronization then immediately synchronizes these changes to all of the subordinate units.When you log into a subordinate unit CLI (for example using
execute ha manage
)
all of the configuration changes that you make to the subordinate unit
are also immediately synchronized to all cluster units, including the
primary unit, using the same process.Incremental synchronization also synchronizes other dynamic configuration information such as the DHCP server address lease database, routing table updates, IPsec SAs, MAC address tables, and so on. See FortiGate HA compatibility with DHCP and PPPoE for more information about DHCP server address lease synchronization and Synchronizing kernel routing tables for information about routing table updates.
Whenever a change is made to a cluster unit configuration, incremental synchronization sends the same configuration change to all other cluster units over the HA heartbeat link. An HA synchronization process running on the each cluster unit receives the configuration change and applies it to the cluster unit. The HA synchronization process makes the configuration change by entering a CLI command that appears to be entered by the administrator who made the configuration change in the first place.
Synchronization takes place silently, and no log messages are recorded about the synchronization activity. However, log messages can be recorded by the cluster units when the synchronization process enters CLI commands. You can see these log messages on the subordinate units if you enable event logging and set the minimum severity level to Information and then check the event log messages written by the cluster units when you make a configuration change.
You can also see these log messages on the primary unit if you make configuration changes from a subordinate unit.
Periodic synchronization
Incremental synchronization makes sure that as an administrator makes configuration changes, the configurations of all cluster units remain the same. However, a number of factors could cause one or more cluster units to go out of sync with the primary unit. For example, if you add a new unit to a functioning cluster, the configuration of this new unit will not match the configuration of the other cluster units. Its not practical to use incremental synchronization to change the configuration of the new unit.Periodic synchronization is a mechanism that looks for synchronization problems and fixes them. Every minute the cluster compares the configuration file checksum of the primary unit with the configuration file checksums of each of the subordinate units. If all subordinate unit checksums are the same as the primary unit checksum, all cluster units are considered synchronized.
If one or more of the subordinate unit checksums is not the same as the primary unit checksum, the subordinate unit configuration is considered out of sync with the primary unit. The checksum of the out of sync subordinate unit is checked again every 15 seconds. This re-checking occurs in case the configurations are out of sync because an incremental configuration sequence has not completed. If the checksums do not match after 5 checks the subordinate unit that is out of sync retrieves the configuration from the primary unit. The subordinate unit then reloads its configuration and resumes operating as a subordinate unit with the same configuration as the primary unit.
The configuration of the subordinate unit is reset in this way because when a subordinate unit configuration gets out of sync with the primary unit configuration there is no efficient way to determine what the configuration differences are and to correct them. Resetting the subordinate unit configuration becomes the most efficient way to resynchronize the subordinate unit.
Synchronization requires that all cluster units run the same FortiOS firmware build. If some cluster units are running different firmware builds, then unstable cluster operation may occur and the cluster units may not be able to synchronize correctly.
Re-installing the firmware build running on the primary unit forces the primary unit to upgrade all cluster units to the same firmware build. |
Console messages when configuration synchronization succeeds
When a cluster first forms, or when a new unit is added to a cluster as a subordinate unit, the following messages appear on the CLI console to indicate that the unit joined the cluster and had its configuring synchronized with the primary unit.
slave's configuration is not in sync with master's, sequence:0
slave's configuration is not in sync with master's, sequence:1
slave's configuration is not in sync with master's, sequence:2
slave's configuration is not in sync with master's, sequence:3
slave's configuration is not in sync with master's, sequence:4
slave starts to sync with master
logout all admin users
slave succeeded to sync with master
No comments:
Post a Comment