FortiGate HA-Cluster Troubleshooting using Checksums

FortiGate HA-Cluster Troubleshooting using Checksums

Comparing checksums of cluster units

You can use the diagnose sys ha checksum show command to compare the configuration checksums of all cluster units. The output of this command shows checksums labeled global and all as well as checksums for each of the VDOMs including the root VDOM. The get system ha-nonsync-csum command can be used to display similar information; however, this command is intended to be used by FortiManager.

The primary unit and subordinate unit checksums should be the same. If they are not you can use the execute ha synchronize start command to force a synchronization.
The following command output is for the primary unit of a cluster that does not have multiple VDOMs enabled:

diagnose sys ha checksum show
is_manage_master()=1, is_root_master()=1
debugzone
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5

checksum
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5 


The following command output is for a subordinate unit of the same cluster:

diagnose sys ha checksum show
is_manage_master()=0, is_root_master()=0
debugzone
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5

checksum
global: a0 7f a7 ff ac 00 d5 b6 82 37 cc 13 3e 0b 9b 77
root: 43 72 47 68 7b da 81 17 c8 f5 10 dd fd 6b e9 57
all: c5 90 ed 22 24 3e 96 06 44 35 b6 63 7c 84 88 d5 


The following example shows using this command for the primary unit of a cluster with multiple VDOMs. Two VDOMs have been added named test and Eng_vdm.
From the primary unit:

config global
diagnose sys ha checksum show
is_manage_master()=1, is_root_master()=1
debugzone
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53

checksum
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53 

From the subordinate unit:

config global
diagnose sys ha checksum show
is_manage_master()=0, is_root_master()=0
debugzone
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53

checksum
global: 65 75 88 97 2d 58 1b bf 38 d3 3d 52 5b 0e 30 a9
test: a5 16 34 8c 7a 46 d6 a4 1e 1f c8 64 ec 1b 53 fe
root: 3c 12 45 98 69 f2 d8 08 24 cf 02 ea 71 57 a7 01
Eng_vdm: 64 51 7c 58 97 79 b1 b3 b3 ed 5c ec cd 07 74 09
all: 30 68 77 82 a1 5d 13 99 d1 42 a3 2f 9f b9 15 53 

How to diagnose HA out of sync messages

This section describes how to use the diagnose sys ha checksum show and diagnose debug commands to diagnose the cause of HA out of sync messages.
If HA synchronization is not successful, use the following procedures on each cluster unit to find the cause.
To determine why HA synchronization does not occur
  1. Connect to each cluster unit CLI by connected to the console port.
  2. Enter the following commands to enable debugging and display HA out of sync messages.
diagnose debug enable
diagnose debug console timestamp enable
diagnose debug application hatalk -1
diagnose debug application hasync -1
Collect the console output and compare the out of sync messages with the information in the table HA out of sync object messages and the configuration objects that they reference.
  1. Enter the following commands to turn off debugging.
diagnose debug disable
diagnose debug reset
To determine what part of the configuration is causing the problem
If the previous procedure displays messages that include sync object 0x30 (for example, HA_SYNC_SETTING_CONFIGURATION = 0x03) there is a synchronization problem with the configuration. Use the following steps to determine the part of the configuration that is causing the problem.
If your cluster consists of two cluster units, use this procedure to capture the configuration checksums for each unit. If your cluster consists of more that two cluster units, repeat this procedure for all cluster units that returned messages that include 0x30 sync object messages.
  1. Connect to each cluster unit CLI by connected to the console port.
  2. Enter the following command to turn on terminal capture
diagnose debug enable

  1. Enter the following command to stop HA synchronization.
execute ha sync stop

  1. Enter the following command to display configuration checksums.
diagnose sys ha checksum show global

  1. Copy the output to a text file.
  2. Repeat for all affected units.
  3. Compare the text file from the primary unit with the text file from each cluster unit to find the checksums that do not match.
You can use a diff function to compare text files.
  1. Repeat for the root VDOM:
diagnose sys ha checksum show root

  1. Repeat for all VDOMS (if multiple VDOM configuration is enabled):
  2. diagnose sys ha checksum show <vdom-name>

  3. You can also use the grep option to just display checksums for parts of the configuration.
For example to display system related configuration checksums in the root VDOM or log-related checksums in the global configuration:
diagnose sys ha checksum root | grep system
diagnose sys ha chechsum global | grep log
Generally it is the first non-matching checksum that is the cause of the synchronization problem.
  1. Attempt to remove/change the part of the configuration that is causing the problem. You can do this by making configuration changes from the primary unit or subordinate unit CLI.
  2. Enter the following commands to start HA configuration and stop debugging:
execute ha sync start
diagnose debug disable
diagnose debug reset

Console messages when configuration synchronization fails

If you connect to the console of a subordinate unit that is out of synchronization with the primary unit, messages similar to the following are displayed.

slave is not in sync with master, sequence:0. (type 0x3)
slave is not in sync with master, sequence:1. (type 0x3)
slave is not in sync with master, sequence:2. (type 0x3)
slave is not in sync with master, sequence:3. (type 0x3)
slave is not in sync with master, sequence:4. (type 0x3)
global compared not matched

If synchronization problems occur the console message sequence may be repeated over and over again. The messages all include a type value (in the example type 0x3). The type value can help Fortinet Support diagnose the synchronization problem.
HA out of sync object messages and the configuration objects that they reference
Out of Sync Message Configuration Object
HA_SYNC_SETTING_CONFIGURATION = 0x03 /data/config
HA_SYNC_SETTING_AV = 0x10
HA_SYNC_SETTING_VIR_DB = 0x11 /etc/vir
HA_SYNC_SETTING_SHARED_LIB = 0x12 /data/lib/libav.so
HA_SYNC_SETTING_SCAN_UNIT = 0x13 /bin/scanunitd
HA_SYNC_SETTING_IMAP_PRXY = 0x14 /bin/imapd
HA_SYNC_SETTING_SMTP_PRXY = 0x15 /bin/smtp
HA_SYNC_SETTING_POP3_PRXY = 0x16 /bin/pop3
HA_SYNC_SETTING_HTTP_PRXY = 0x17 /bin/thttp
HA_SYNC_SETTING_FTP_PRXY = 0x18 /bin/ftpd
HA_SYNC_SETTING_FCNI = 0x19 /etc/fcni.dat
HA_SYNC_SETTING_FDNI = 0x1a /etc/fdnservers.dat
HA_SYNC_SETTING_FSCI = 0x1b /etc/sci.dat
HA_SYNC_SETTING_FSAE = 0x1c /etc/fsae_adgrp.cache
HA_SYNC_SETTING_IDS = 0x20 /etc/ids.rules
HA_SYNC_SETTING_IDSUSER_RULES = 0x21 /etc/idsuser.rules
HA_SYNC_SETTING_IDSCUSTOM = 0x22
HA_SYNC_SETTING_IDS_MONITOR = 0x23 /bin/ipsmonitor
HA_SYNC_SETTING_IDS_SENSOR = 0x24 /bin/ipsengine
HA_SYNC_SETTING_NIDS_LIB = 0x25 /data/lib/libips.so
HA_SYNC_SETTING_WEBLISTS = 0x30
HA_SYNC_SETTING_CONTENTFILTER = 0x31 /data/cmdb/webfilter.bword
HA_SYNC_SETTING_URLFILTER = 0x32 /data/cmdb/webfilter.urlfilter
HA_SYNC_SETTING_FTGD_OVRD = 0x33 /data/cmdb/webfilter.fgtd-ovrd
HA_SYNC_SETTING_FTGD_LRATING = 0x34 /data/cmdb/webfilter.fgtd-ovrd
HA_SYNC_SETTING_EMAILLISTS = 0x40
HA_SYNC_SETTING_EMAILCONTENT = 0x41 /data/cmdb/spamfilter.bword
HA_SYNC_SETTING_EMAILBWLIST = 0x42 /data/cmdb/spamfilter.emailbwl
HA_SYNC_SETTING_IPBWL = 0x43 /data/cmdb/spamfilter.ipbwl
HA_SYNC_SETTING_MHEADER = 0x44 /data/cmdb/spamfilter.mheader
HA_SYNC_SETTING_RBL = 0x45 /data/cmdb/spamfilter.rbl
HA_SYNC_SETTING_CERT_CONF = 0x50 /etc/cert/cert.conf
HA_SYNC_SETTING_CERT_CA = 0x51 /etc/cert/ca
HA_SYNC_SETTING_CERT_LOCAL = 0x52 /etc/cert/local
HA_SYNC_SETTING_CERT_CRL = 0x53 /etc/cert/crl
HA_SYNC_SETTING_DB_VER = 0x55
HA_GET_DETAIL_CSUM = 0x71
HA_SYNC_CC_SIG = 0x75 /etc/cc_sig.dat
HA_SYNC_CC_OP = 0x76 /etc/cc_op
HA_SYNC_CC_MAIN = 0x77 /etc/cc_main
HA_SYNC_FTGD_CAT_LIST = 0x7a /migadmin/webfilter/ublock/ftgd/ data/

Synchronizing the configuration

The FGCP uses a combination of incremental and periodic synchronization to make sure that the configuration of all cluster units is synchronized to that of the primary unit.
The following settings are not synchronized between cluster units:
  • HA override.
  • HA device priority.
  • The virtual cluster priority.
  • The FortiGate host name.
  • The HA priority setting for a ping server (or dead gateway detection) configuration.
  • The system interface settings of the HA reserved management interface.
  • The HA default route for the reserved management interface, set using the ha-mgmt-interface-gateway option of the config system ha command.
The primary unit synchronizes all other configuration settings, including the other HA configuration settings.
All synchronization activity takes place over the HA heartbeat link using TCP/703 and UDP/703 packets.

Recalculating the checksums to resolve out of sync messages

Sometimes an error can occur when checksums are being calculated by the cluster. As a result of this calculation error the CLI console could display out of sync error messages even though the cluster is otherwise operating normally. You can also sometimes see checksum calculation errors in diagnose sys ha checksum command output when the checksums listed in the debugzone output don’t match the checksums in the checksum part of the output.
One solution to this problem could be to re-calculate the checksums. The re-calculated checksums should match and the out of sync error messages should stop appearing.
You can use the following command to re-calculate HA checksums:
diagnose sys ha checksum recalculate [<vdom-name> | global]
Just entering the command without options recalculates all checksums. You can specify a VDOM name to just recalculate the checksums for that VDOM. You can also enter global to recalculate the global checksum.


Disabling automatic configuration synchronization

In some cases you may want to use the following command to disable automatic synchronization of the primary unit configuration to all cluster units.

config system ha
set sync-config disable
end

When this option is disabled the cluster no longer synchronizes configuration changes. If a device failure occurs, the new primary unit may not have the same configuration as the failed primary unit. As a result, the new primary unit may process sessions differently or may not function on the network in the same way.
In most cases you should not disable automatic configuration synchronization. However, if you have disabled this feature you can use the execute ha synchronize command to manually synchronize a subordinate unit’s configuration to that of the primary unit.
You must enter execute ha synchronize commands from the subordinate unit that you want to synchronize with the primary unit. Use the execute ha manage command to access a subordinate unit CLI.
For example, to access the first subordinate unit and force a synchronization at any time, even if automatic synchronization is disabled enter:
execute ha manage 0
execute ha synchronize start
You can use the following command to stop a synchronization that is in progress.
execute ha synchronize stop

Incremental synchronization

When you log into the cluster GUI or CLI to make configuration changes, you are actually logging into the primary unit. All of your configuration changes are first made to the primary unit. Incremental synchronization then immediately synchronizes these changes to all of the subordinate units.

When you log into a subordinate unit CLI (for example using execute ha manage) all of the configuration changes that you make to the subordinate unit are also immediately synchronized to all cluster units, including the primary unit, using the same process.
Incremental synchronization also synchronizes other dynamic configuration information such as the DHCP server address lease database, routing table updates, IPsec SAs, MAC address tables, and so on. See FortiGate HA compatibility with DHCP and PPPoE for more information about DHCP server address lease synchronization and Synchronizing kernel routing tables for information about routing table updates.

Whenever a change is made to a cluster unit configuration, incremental synchronization sends the same configuration change to all other cluster units over the HA heartbeat link. An HA synchronization process running on the each cluster unit receives the configuration change and applies it to the cluster unit. The HA synchronization process makes the configuration change by entering a CLI command that appears to be entered by the administrator who made the configuration change in the first place.

Synchronization takes place silently, and no log messages are recorded about the synchronization activity. However, log messages can be recorded by the cluster units when the synchronization process enters CLI commands. You can see these log messages on the subordinate units if you enable event logging and set the minimum severity level to Information and then check the event log messages written by the cluster units when you make a configuration change.
You can also see these log messages on the primary unit if you make configuration changes from a subordinate unit.

Periodic synchronization

Incremental synchronization makes sure that as an administrator makes configuration changes, the configurations of all cluster units remain the same. However, a number of factors could cause one or more cluster units to go out of sync with the primary unit. For example, if you add a new unit to a functioning cluster, the configuration of this new unit will not match the configuration of the other cluster units. Its not practical to use incremental synchronization to change the configuration of the new unit.

Periodic synchronization is a mechanism that looks for synchronization problems and fixes them. Every minute the cluster compares the configuration file checksum of the primary unit with the configuration file checksums of each of the subordinate units. If all subordinate unit checksums are the same as the primary unit checksum, all cluster units are considered synchronized.
If one or more of the subordinate unit checksums is not the same as the primary unit checksum, the subordinate unit configuration is considered out of sync with the primary unit. The checksum of the out of sync subordinate unit is checked again every 15 seconds. This re-checking occurs in case the configurations are out of sync because an incremental configuration sequence has not completed. If the checksums do not match after 5 checks the subordinate unit that is out of sync retrieves the configuration from the primary unit. The subordinate unit then reloads its configuration and resumes operating as a subordinate unit with the same configuration as the primary unit.
The configuration of the subordinate unit is reset in this way because when a subordinate unit configuration gets out of sync with the primary unit configuration there is no efficient way to determine what the configuration differences are and to correct them. Resetting the subordinate unit configuration becomes the most efficient way to resynchronize the subordinate unit.
Synchronization requires that all cluster units run the same FortiOS firmware build. If some cluster units are running different firmware builds, then unstable cluster operation may occur and the cluster units may not be able to synchronize correctly.
note icon Re-installing the firmware build running on the primary unit forces the primary unit to upgrade all cluster units to the same firmware build.

Console messages when configuration synchronization succeeds

When a cluster first forms, or when a new unit is added to a cluster as a subordinate unit, the following messages appear on the CLI console to indicate that the unit joined the cluster and had its configuring synchronized with the primary unit.
slave's configuration is not in sync with master's, sequence:0
slave's configuration is not in sync with master's, sequence:1
slave's configuration is not in sync with master's, sequence:2
slave's configuration is not in sync with master's, sequence:3
slave's configuration is not in sync with master's, sequence:4
slave starts to sync with master
logout all admin users
slave succeeded to sync with master

No comments:

Post a Comment

Update proxmox 6.4.x to 7.x

Updating a proxmox system from version 6.4.x to 7.x using https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0 Proxmox VE 6.x is based on De...