Geographically redundant datacenters - their performance issues and designs solutions

Performance problem - Distance and Latency

More and more companys face the challenging requirement to provide available services, which survive regional desasters. In other words: Their datacenters must be located in different geographical regions. High availability solutions, storage systems, systems which do a lot of sequential or write-operations often require very low latency, in order to work stable, especially in peak or high load situations.

I recently saw an application, which had performance issues with some daily jobs, as soon the application server was running in another datacenter, which was connected by dedicated 2x 40Gbit/s lines with about 5 kilometers distance and 0.2ms latency in between the application server and the database servers. As soon as the application server was moved to the same datacenter as the datacase, these jobs only took about 1 hour instead of 8 hours.
👉These 0.2ms latency added up to slow down the application drastically
👉Latency is critical when it comes to performance (see also Bandwidth delay product or Long Fat Pipes LFN in RFC1072)

But: As soon you have the requirement to have datacenters separated hundreds of kilometeres, in order to be geographically redundant, your latency will grow, after you can't change the speed of light.

👉If the datacenters in the example wouldn't be only separated by 5 kilometers, but by 200 kilometers, latency would grow from 0.2ms to 3-5ms (or more), which would be the death of the application, because those jobs would take too long.

That results in: You can't (and shouldn't) stretch applications, high availability clusters, SANs, etc over long pipes, even if they are fat. E.g. even if you have many 100GBit/s directly connected dark fibre connections between your datacenters, the higher the latency, the worse your performance. (Of course bandwidth is important, too, as well as other factors)

How do others do it?

Microsoft released some documents, which talk about possible solutions:
https://docs.microsoft.com/en-us/sharepoint/administration/plan-for-disaster-recovery
  • Cold standby. A secondary data center that can provide availability within hours or days.
  • Warm standby. A secondary data center that can provide availability within minutes or hours.
  • Hot standby. A secondary data center that can provide availability within seconds or minutes.
[...]
Important
: Available network bandwidth and latency are major considerations when you are using a failover approach for disaster recovery. We recommend that you consult with your SAN vendor to determine whether you can use SAN replication for SQL databases or another supported mechanism to provide the hot standby level of availability across data centers.

There is also a video of the great Mark Russinovich (Microsoft Azure CTO) https://www.youtube.com/watch?v=X-0V6bYfTpA


In this video Microsoft talks about its Azure datacenter requirements and mentions, that within a region, the network latency perimeter must stay under 2ms. Mark Russinovich also mentions, that therefore Datacenters are within a 100 kilometers range of each other.

There are other useful documents, for example for datacase redundancy, which is for example about business continuity planning, recovery point objective (RPO) and estimated recovery time (ERT), as well as thinking about how to monitoring a failure of a site and how ERT is affected not only by the cluster switch, but also by failure detection time + DNS TTL.
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-designing-cloud-solutions-for-disaster-recovery


There is also a great article from Percona.com-Blog, which talks about this issue: https://www.percona.com/blog/2018/11/15/how-not-to-do-mysql-high-availability-geographic-node-distribution-with-galera-based-replication-misuse/
We had two datacenters.
  • The connection between the two was with fiber
  • Distance Km ~400, but now we MUST consider the distance to go and come back. This because in case of real communication, we have not only the send, but also the receive packages.
  • Theoretical time at light-speed =2.66ms (2 ways)
  • Ping = 3.10ms (signal traveling at ~80% of the light speed) as if the signal had traveled ~930Km (full roundtrip 800 Km)
  • TCP/IP best at 48K = 4.27ms (~62% light speed) as if the signal had traveled ~1,281km
  • TCP/IP best at 512K =37.25ms (~2.6% light speed) as if the signal had traveled ~11,175km
 Given the above, we have from ~20%-~40% to ~97% loss from the theoretical transmission rate. Keep in mind that when moving from a simple signal to a more heavy and concurrent transmission, we also have to deal with the bandwidth limitation. This adds additional cost. All in only 400Km distance.
This is not all. Within the 400km we were also dealing with data congestions, and in some cases the tests failed to provide the level of accuracy we required due to transmission failures and too many packages retry.

[...]
What Is the Right Thing To Do? 

The right solution is easier than the wrong one, and there are already tools in place to make it work efficiently. Say you need to define your HA solution between the East and West Coast, or between Paris and Frankfurt. First of all, identify the real capacity of your network in each DC. Then build a tightly coupled database cluster in location A and another tightly coupled database cluster in the other location B. Then link them using ASYNCHRONOUS replication.



Conclusion

👉 Unfortunately there is no easy solution for this. Each application, system and solution has it's own requirements and possible solutions. But: Don't underestimate the distance between your datacenters and the latency comming with it, even if only a few milliseconds don't sound much.


List FortiGate Certificates via CLI - CA certificates and local Certificates


You can either use the GUI of the FortiGate to list all certificates, or use the CLI. Either using the commands:

Using the "get" command

config vdom
edit root   #<--- your management vdom/your vdom of choice
get vpn certificate ca

FGT50E00000000 (root) #
FGT50E00000000 (root) # get vpn certificate ca
== [ Fortinet_Wifi_CA ]
name: Fortinet_Wifi_CA
== [ Fortinet_CA ]
name: Fortinet_CA
== [ ACCVRAIZ1 ]
name: ACCVRAIZ1
== [ AC_RAIZ_FNMT-RCM ]
name: AC_RAIZ_FNMT-RCM
== [ Actalis_Authentication_Root_CA ]
name: Actalis_Authentication_Root_CA

[...]


Using the "show" command

The show command might not be very helpful, because it does not necessarily show all certificates:

FGT50E00000000 (root) #
FGT50E00000000 (root) # show vpn certificate ca
config vpn certificate ca
end

FGT50E00000000 (root) # show full-configuration vpn certificate ca
config vpn certificate ca
end


FGT50E00000000 (root) # show full-configuration | grep -f 'vpn certificate ca'
config vpn certificate ca <---
end

FGT50E00000000 (root) #

Using the "fnsysctl" command

Using the fnsysctl command might be helpful:

FGT50E00000000  #
FGT50E00000000 # fnsysctl ls -la /etc/cert/local/
drwxr-xr-x    2 0    0   Wed Dec 25 21:43:14 2019        0 .
drwxr-xr-x    6 0    0   Wed Sep 18 20:39:27 2019        0 ..
-rw-------    1 0    0   Wed Sep 18 20:35:46 2019     2250 root_2020jan_sub.domain.tld.cer
-rw-------    1 0    0   Wed Sep 18 20:35:46 2019     1704 KEY-FILE
-rw-------    1 0    0   Wed Sep 18 20:35:46 2019     1407 root_Fortinet_CA_SSL.cer
-rw-------    1 0    0   Wed Sep 18 20:35:47 2019     1704 KEY-FILE
-rw-------    1 0    0   Wed Sep 18 20:35:47 2019     1419 root_Fortinet_CA_Untrusted.cer
-rw-------    1 0    0   Wed Sep 18 20:35:47 2019     1704 KEY-FILE
-rw-------    1 0    0   Wed Sep 18 20:35:47 2019     4285 root_Fortinet_Factory.cer
-rw-------    1 0    0   Wed Sep 18 20:35:47 2019     1679 KEY-FILE
[...]

FGT50E00000000  #
FGT50E00000000  # fnsysctl ls -la /etc/cert/ca
drwxr-xr-x    2 0    0   Wed Dec 25 21:41:28 2019        0 .
drwxr-xr-x    6 0    0   Wed Sep 18 20:39:27 2019        0 ..
-rw-------    1 0    0   Wed Sep 18 20:35:55 2019      119 ca_bundle_ver
-rw-------    1 0    0   Tue Jan 14 20:06:15 2020 1972 root_AC_RAIZ_FNMT-RCM.cer
-rw-------    1 0    0   Tue Jan 14 20:06:15 2020 2772 root_ACCVRAIZ1.cer
-rw-------    1 0    0   Wed Sep 18 20:35:55 2019     2041 root_ACEDICOM_Root.cer
-rw-------    1 0    0   Tue Jan 14 20:06:15 2020 2049 root_Actalis_Authentication_Root_CA.cer
-rw-------    1 0    0   Tue Jan 14 20:06:14 2020 1521 root_AddTrust_External_Root.cer
[...]

Cribl - Change values to lowerCase

Some logs (e.g. Microsoft Azure) sometimes are not fully normalized to all lowercase characters. You can use Cribl to adjust those values by...