ProxySQL, MySQL Group Replication, and Latency

栏目: IT技术 · 发布时间: 5年前

内容简介：Similar in functionality toA high performance, high availability, protocol aware proxy for MySQL. It allows the shaping of database traffic by delaying, caching or rewriting queries on the fly. ProxySQL can also be used to create an environment where failo

ProxySQL, MySQL Group Replication, and Latency While we’ve had MySQL Group Replication support in ProxySQL since version 1.3 (native as of v1.4), development has continued in subsequent versions. I’d like to describe a scenario of how latency can affect ProxySQL in a MySQL Group Replication environment, and outline a few new features that might help mitigate those issues. Before we dive into the specifics of the discussion, however, let’s take a quick overview of ProxySQL and Group Replication for those who may not be familiar.

MySQL Group Replication

Similar in functionality to Percona XtraDB Cluster or Galera, MySQL Group Replication is the only synchronous native HA solution for MySQL * . With built-in automatic distributed recovery, conflict detection, and group membership, MySQL GR provides a completely native HA solution for MySQL environments.

ProxySQL

A high performance, high availability, protocol aware proxy for MySQL. It allows the shaping of database traffic by delaying, caching or rewriting queries on the fly. ProxySQL can also be used to create an environment where failovers will not affect your application, automatically removing (and adding back) database nodes from a cluster based on definable thresholds.

*There is technically one other native HA solution from Oracle – MySQL NDB Cluster. However, it is outside the scope of this article and not for most general use cases.

Test Case

I recently had an interesting case with a client who was having severe issues with latency due to network/storage stalls at the hypervisor level. The environment is fairly standard, with a single MySQL 8.x GR writer node, and two MySQL 8.x GR passive nodes. In front of the database cluster sits ProxySQL, routing traffic to the active writer and handling failover duties should one of the database nodes become unavailable. The latency always occurred in short spikes, ramping up and then falling off quickly (within seconds).

The latency and I/O stalls from the network/hypervisor were throwing ProxySQL a curveball in determining if a node was actually healthy or not, and the client was seeing frequent failovers of the active writer node – often multiple times per day. To dive a bit deeper into this, let’s examine how ProxySQL determines a node’s health at a high level.

PING
- mysql-monitor_ping_timeout
  - Issued on open connection.
SELECT
- mysql-monitor_groupreplication_healthcheck_timeout
  - Gets the number of transactions a node is behind and identifies which node is the writer.
CONNECT
- mysql-monitor_ping_timeout
  - Will try to open new connections to the host and measure timing.

In a perfect environment, these checks work as intended, and if a node is not reachable, or has fallen too far behind, ProxySQL is able to determine that and remove the node from the cluster. This is known as a hard_offline in ProxySQL, and means the node is removed from the routing table and all traffic to that node stops. If that node is the writer node, ProxySQL will then tee up one of the passive nodes as the active writer, and the failover is complete.

Many of the ProxySQL health checks have multiple variables to control the timeout behavior. For instance, mysql-monitor_ping_timeout sets the maximum timeout for a MySQL node to be unresponsive to a ping, and mysql-monitor_ping_max_failures set up how many times a MySQL node would have to fail a ping check before ProxySQL decides to mark it hard_offline and pull the node out of the cluster.

This wasn’t the case for the Group Replication specific ping checks, however. Prior to version 2.0.7, the options were more limited for Group Replication checks. Note we did not have the same max_failures for Group Replication that we had for standalone MySQL, and we only had the timeout check:

mysql-monitor_groupreplication_healthcheck_timeout

Added in version 2.0.7 was a new variable, giving us the ability to retry multiple times before marking a GR node hard_offline:

mysql-monitor_groupreplication_healthcheck_max_timeout_count

By setting this variable it is possible to have the group replication health check fail a configurable number of times before pulling a node out of the cluster. While this is certainly more of a Band-Aid than an actual resolution, it would allow keeping a ProxySQL + GR environment up and running while work is being done to find the root cause of latency and prevent unnecessary flapping between active and passive nodes during short latency spikes and I/O stalls.

Another similar option is currently being implemented in ProxySQL 2.0.9 for the transactions_behind check. See below:

mysql-monitor_groupreplication_max_transactions_behind

Currently, if group replication max_transactions_behind exceeds the threshold once, the node is evicted from the cluster. The upcoming 2.0.9 release features another additional variable which will define a count for such checks so that max_transactions_behind would have to fail more than once (x number of times) before eviction.

mysql-monitor_groupreplication_max_transactions_behind_count

In Summary

To be clear, the above settings will not fix any latency issues present in your environment. However, since latency can often be a hardware or network issue, and in many cases can take time to track down, these options may stabilize the environment by allowing you to relax ProxySQL’s health checks while the root cause investigation for the latency is underway.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网