Geographically redundant datacenters - their performance issues and designs solutions

Performance problem - Distance and Latency

More and more companys face the challenging requirement to provide available services, which survive regional desasters. In other words: Their datacenters must be located in different geographical regions. High availability solutions, storage systems, systems which do a lot of sequential or write-operations often require very low latency, in order to work stable, especially in peak or high load situations.

I recently saw an application, which had performance issues with some daily jobs, as soon the application server was running in another datacenter, which was connected by dedicated 2x 40Gbit/s lines with about 5 kilometers distance and 0.2ms latency in between the application server and the database servers. As soon as the application server was moved to the same datacenter as the datacase, these jobs only took about 1 hour instead of 8 hours.
👉These 0.2ms latency added up to slow down the application drastically
👉Latency is critical when it comes to performance (see also Bandwidth delay product or Long Fat Pipes LFN in RFC1072)

But: As soon you have the requirement to have datacenters separated hundreds of kilometeres, in order to be geographically redundant, your latency will grow, after you can't change the speed of light.

👉If the datacenters in the example wouldn't be only separated by 5 kilometers, but by 200 kilometers, latency would grow from 0.2ms to 3-5ms (or more), which would be the death of the application, because those jobs would take too long.

That results in: You can't (and shouldn't) stretch applications, high availability clusters, SANs, etc over long pipes, even if they are fat. E.g. even if you have many 100GBit/s directly connected dark fibre connections between your datacenters, the higher the latency, the worse your performance. (Of course bandwidth is important, too, as well as other factors)

How do others do it?

Microsoft released some documents, which talk about possible solutions:
https://docs.microsoft.com/en-us/sharepoint/administration/plan-for-disaster-recovery
  • Cold standby. A secondary data center that can provide availability within hours or days.
  • Warm standby. A secondary data center that can provide availability within minutes or hours.
  • Hot standby. A secondary data center that can provide availability within seconds or minutes.
[...]
Important
: Available network bandwidth and latency are major considerations when you are using a failover approach for disaster recovery. We recommend that you consult with your SAN vendor to determine whether you can use SAN replication for SQL databases or another supported mechanism to provide the hot standby level of availability across data centers.

There is also a video of the great Mark Russinovich (Microsoft Azure CTO) https://www.youtube.com/watch?v=X-0V6bYfTpA


In this video Microsoft talks about its Azure datacenter requirements and mentions, that within a region, the network latency perimeter must stay under 2ms. Mark Russinovich also mentions, that therefore Datacenters are within a 100 kilometers range of each other.

There are other useful documents, for example for datacase redundancy, which is for example about business continuity planning, recovery point objective (RPO) and estimated recovery time (ERT), as well as thinking about how to monitoring a failure of a site and how ERT is affected not only by the cluster switch, but also by failure detection time + DNS TTL.
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-designing-cloud-solutions-for-disaster-recovery


There is also a great article from Percona.com-Blog, which talks about this issue: https://www.percona.com/blog/2018/11/15/how-not-to-do-mysql-high-availability-geographic-node-distribution-with-galera-based-replication-misuse/
We had two datacenters.
  • The connection between the two was with fiber
  • Distance Km ~400, but now we MUST consider the distance to go and come back. This because in case of real communication, we have not only the send, but also the receive packages.
  • Theoretical time at light-speed =2.66ms (2 ways)
  • Ping = 3.10ms (signal traveling at ~80% of the light speed) as if the signal had traveled ~930Km (full roundtrip 800 Km)
  • TCP/IP best at 48K = 4.27ms (~62% light speed) as if the signal had traveled ~1,281km
  • TCP/IP best at 512K =37.25ms (~2.6% light speed) as if the signal had traveled ~11,175km
 Given the above, we have from ~20%-~40% to ~97% loss from the theoretical transmission rate. Keep in mind that when moving from a simple signal to a more heavy and concurrent transmission, we also have to deal with the bandwidth limitation. This adds additional cost. All in only 400Km distance.
This is not all. Within the 400km we were also dealing with data congestions, and in some cases the tests failed to provide the level of accuracy we required due to transmission failures and too many packages retry.

[...]
What Is the Right Thing To Do? 

The right solution is easier than the wrong one, and there are already tools in place to make it work efficiently. Say you need to define your HA solution between the East and West Coast, or between Paris and Frankfurt. First of all, identify the real capacity of your network in each DC. Then build a tightly coupled database cluster in location A and another tightly coupled database cluster in the other location B. Then link them using ASYNCHRONOUS replication.



Conclusion

👉 Unfortunately there is no easy solution for this. Each application, system and solution has it's own requirements and possible solutions. But: Don't underestimate the distance between your datacenters and the latency comming with it, even if only a few milliseconds don't sound much.


No comments:

Post a Comment

Nextron Aurora EDR agent shows \Pr Error

Problem During start of Nextrons Aurora EDR lite agent the programm shows the following error message: PS C:\Program Files\Aurora-Agent...