Redis集群的主从切换研究

内容简介：6.6.3.6.4.

1. 前言

Redis 官方原文： https://redis.io/topics/cluster-spec 。另外，从Redis-5.0 开始， slave 已改叫 replica ，配置项和部分文档及变量已做改名。

Redis 集群的主从切换采取选举机制，要求少数服从多数，而参与选举的只能为 master ，所以只有多数 master 存活动时才能进行，选举由 slave 发起。

Redis 用了和 Raft 算法 term （任期）类似的的概念，在 Redis 中叫作 epoch （纪元）， epoch 是一个无符号的 64 整数，一个节点的 epoch 从 0 开始。

如果一个节点接收到的 epoch 比自己的大，则将自已的 epoch 更新接收到的 epoch （假定为信任网络，无拜占庭将军问题）。

每个 master 都会在 ping 和 pong 消息中广播自己的 epoch 和所负责的 slots 位图， slave 发起选举时，创建一个新的 epoch （增一）， epoch 的值会持久化到文件 nodes.conf 中，如（最新 epoch 值为 27 ，最近一次投票给了 27 ）：

vars currentEpoch 27 lastVoteEpoch 27

2. slave 发起选举

只有 master 为 fail 状态， slave 才会发起选举。但并不是 master 为 fail 时立即发起选举，而是延迟下列随机时长，以避免多个 slaves 同时发起选举（至少延迟 0.5 秒后才会发起选举）：

500 milliseconds + random delay between 0 and 500 milliseconds + SLAVE_RANK * 1000 milliseconds

一个 slave 发起选举的条件：

1) 它的 master 为 fail 状态（非 pfail 状态）；

2) 它的 master 至少负责了一个 slot ；

3) slave 和 master 的复制连接断开时间不超过给定的值（值可配置，目的是确保 slave 上的数据足够完整， 所以运维时不能任由一个 slave 长时间不可用，需要通过监控将异常的 slave 及时恢复 ）。

发起选举前， slave 先给自己的 epoch （即 currentEpoch ）增一，然后请求其它 master 给自己投票。 slave 是通过广播 FAILOVER_ AUTH _REQUEST 包给集中的每一个 masters 。

slave 发起投票后，会等待至少两倍 NODE_TIMEOUT 时长接收投票结果，不管 NODE_TIMEOUT 何值，也至少会等待 2 秒。

master 接收投票后给 slave 响应 FAILOVER_ AUTH_ACK ，并且在（ NODE_TIMEOUT*2 ）时间内不会给同一 master 的其它 slave 投票。

如果 slave 收到 FAILOVER_ AUTH_ACK 响应的 epoch 值小于自己的 epoch ，则会直接丢弃。一旦 slave 收到多数 master 的 FAILOVER_ AUTH_ACK ，则声明自己赢得了选举。

如果 slave 在两倍的 NODE_TIMEOUT 时间内（至少 2 秒）未赢得选举，则放弃本次选举，然后在四倍 NODE_TIMEOUT 时间（至少 4 秒）后重新发起选举。

只所以强制延迟至少 0.5 秒 选举，是为确保 master 的 fail 状态在整个集群内传开，否则可能只有小部分 master 知晓，而 master 只会给处于 fail 状态的 master 的 slaves 投票。如果一个 slave 的 master 状态不是 fail ，则其它 master 不会给它投票， Redis 通过八卦协议（即 Gossip 协议，也叫谣言协议）传播 fail 。而在固定延迟上再加一个随机延迟，是为了避免多个 slaves 同时发起选举。

slave 的 SLAVE_RANK 是一个与 master 复制数有关的值，具有最新复制时 SLAVE_RANK 值为 0 ，第二则为 1 ，以此类推。这样可让具有最全数据的 slave 优先发起选举。当具有更高 SLAVE_RANK 值的 slave 如果没有当选，则其它 slaves 会很快发起选举（至少 4 秒后）。

在 slave 赢得选举后，会向集群内的所有节点广播 pong ，以尽快完成重新配置（体现在 node.conf 的更新）。当前未能到达的节点，最终也会完成重新配置。

其它节点会发现有两个相同的 master 负责相同的 slots ，这时就看哪个 master 的 epoch 值更大。

slave 成为 master 后，并不立即服务，而是留了一个时间差。

3. master 响应选举

master 收到 slave 的投票请求 FAILOVER_ AUTH _REQUEST 后，只有满足下列条件时，才会响应投票：

1) 对一个 epoch ，只投票一次；

2) 会拒绝所有更小 epoch 的投票请求；

3) 不会给小于 lastVoteEpoch 的 epoch 投票；

4) master 只给 master 状态为 fail 的 slave 投票；

5) 如果 slave 请求的 currentEpoch 小于 master 的 currentEpoch ，则 master 忽略该请求，但下列情况例外：

① 假设 master的currentEpoch值为5，lastVoteEpoch值为1（当有选举失败会出现这个情况，亦即currentEpoch值增加了，但因为选举失败，lastVoteEpoch值未变）；

② slave的currentEpoch值为3；

③ slave增一，使用值为4的epoch发起选举，这个时候master会响应epoch值为5，不巧这个响应延迟了；

④ slave重新发起选举，这个时候选举用的epoch值为5（每次发起选举epoch值均需增一），凑巧这个时候原来延迟的响应达到了，这个时候原来延迟的响应被slave认为有效。

在 master 投票后，会用请求中的 epoch 更新本地的 lastVoteEpoch ，并持久化到 node.conf 文件中。 master 不会参与选择最优的 slave ，由于最优的 slave 有最好的 SLAVE_RANK ，因此最优的 slave 可相对更快发起选举。

4. 选举示例

假设一个 master 有 A 、 B 和 C 三个 slaves 节点，当这个 master 不可达时：

1) 假设 slave A 赢得选举成为 master ；

2) slave A 因为网络分区不再可用；

3) slave B 赢得选举；

4) slave B 因为网络分区不再可用；

5) 网络分区修复， slave A 又可用。

B 挂了， A 又可用。同一时刻， slave C 发起选举，试图替代 B 成为 master 。由于 slave C 的 master 已不可用，所以它能够选举成为 master ，并将 configEpoch 值增一。而 A 将不能成为 master ，因为 C 已成为 master ，并且 C 的 epoch 值更大。

5. 哈希槽传播方式

有两种哈希槽（ hash slot ）传播途径：

1) 心跳消息（ Heartbeat messages ）。节点在发送 ping 和 pong 消息时，总是携带了它所负责（或它的 master 所负责）的哈希槽信息；

2) 更新消息（ UPDATE messages ）。由于心跳包还包含了 epoch 信息，当消息接收者发现心跳包携带的信息陈旧时，会响应更新的信息，这样强迫发送者更新哈希槽。

6. 一次主从切换记录 1

测试集群运行在同一个物理机上， cluster-node-timeout 值比 repl-timeout 值大。

6.1. 相关参数

cluster-slave-validity-factor值为1

cluster-node-timeout值为 3 0 000

repl-ping-slave-period值为 1

repl-timeout 值为 10

6.2. 时间点记录

master 为 FAIL 之时的 1 秒左右时间内，即为主从切换之时。

master A标记 fail时间：20:12: 55 .467

master B 标记 fail时间：20:12:55.467

master A投票时间：20:12:56.164

master B投票时间：20:12:56.164

slave发起选举时间：20:12: 56 .160

slave准备发起选举时间：20:12:55.558（延迟579毫秒）

slave发现和master心跳超时时间：20:12: 32 .810（ 在这之后 24 秒才发生主从切换 ）

slave收到其它master发来的自己的master为 fail 时间：20:12:55.467

切换前服务最后一次正常时间：（服务异常约发生在秒）20:12:22/279275

切换后服务恢复正常时间：20:12:59/278149

服务不可用时长：约 37秒

6.3. 其它 master 日志

该 master ID 为 c67dc9e02e25f2e6321df8ac2eb4d99789917783 。

30613:M 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 从其它master收到 44eb43e50c101c5f44f48295c42dda878b6cb3e9 已fail消息

30613:M 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

30613:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 对选举投票

30613:M 04 Jan 2019 20:12:56.204 # Cluster state changed: ok

30613:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

6.4. 其它 master 日志

该 master ID 为 bfad383775421b1090eaa7e0b2dcfb3b38455079 。

30614:M 04 Jan 2019 20:12:55.467 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached). // 标记 44eb43e50c101c5f44f48295c42dda878b6cb3e9 为已fail

30614:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 对选举投票

30614:M 04 Jan 2019 20:12:56.709 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

6.5. slave 日志

slave 的 master ID 为 44eb43e50c101c5f44f48295c42dda878b6cb3e9 ， slave 自己的 ID 为 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 。

30651:S 04 Jan 2019 20:12: 32 .810 # MASTER timeout : no data nor PING received... // 发现master超时，master异常10秒后发现，原因是 repl-timeout 的值为10

30651:S 04 Jan 2019 20:12:32.810 # Connection with master lost.

30651:S 04 Jan 2019 20:12:32.810 * Caching the disconnected master state.

30651:S 04 Jan 2019 20:12:32.810 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:32.810 * MASTER<-> REPLICA sync started

30651:S 04 Jan 2019 20:12:32.810 * Non blocking connect for SYNC fired the event.

30651:S 04 Jan 2019 20:12:43.834 # Timeout connecting to the MASTER...

30651:S 04 Jan 2019 20:12:43.834 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:43.834 * MASTER<-> REPLICA sync started

30651:S 04 Jan 2019 20:12:43.834 * Non blocking connect for SYNC fired the event.

30651:S 04 Jan 2019 20:12:54.856 # Timeout connecting to the MASTER...

30651:S 04 Jan 2019 20:12:54.856 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:54.856 * MASTER<-> REPLICA sync started

30651:S 04 Jan 2019 20:12:54.856 * Non blocking connect for SYNC fired the event.

30651:S 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 从其它master收到自己的master的 FAIL 消息

30651:S 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

30651:S 04 Jan 2019 20:12:55.558 # Start of election delayed for 579 milliseconds (rank #0, offset 227360). // 准备发起选举，延迟579毫秒，其中500毫秒为固定延迟，279秒为随机延迟，因为RANK值为0，所以RANK延迟为0毫秒

30651:S 04 Jan 2019 20:12:56.160 # Starting a failover election for epoch 30. // 发起选举

30651:S 04 Jan 2019 20:12:56.180 # Failover election won: I'm the new master. // 赢得选举

30651:S 04 Jan 2019 20:12:56.180 # configEpoch set to 30 after successful failover

30651:M 04 Jan 2019 20:12:56.180 # Setting secondary replication ID to 154a9c2319403d610808477dcda3d4bede0f374c, valid up to offset: 227361. New replication ID is 927fb64a420236ee46d39389611ab2d8f6530b6a

30651:M 04 Jan 2019 20:12:56.181 * Discarding previously cached master state.

30651:M 04 Jan 2019 20:12: 56 .181 # Cluster state changed: ok

30651:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 忽略来自非集群成员 1.9.16.9:407 7的消息

7. 一次主从切换记录 2

测试集群运行在同一个物理机上， cluster-node-timeout 值比 repl-timeout 值小。

7.1. 相关参数

cluster-slave-validity-factor值为1

cluster-node-timeout值为 1 0 000

repl-ping-slave-period值为 1

repl-timeout 值为 30

7.2. 时间点记录

master 为 FAIL 之时的 1 秒左右时间内，即为主从切换之时。

master A标记 fail时间：20:37:10.398

master B 标记 fail时间：20:37:10.398

master A投票时间：20:37:11.084

master B投票时间：20:37:11.085

slave发起选举时间：20:37:11.077

slave准备发起选举时间：20:37:10.475（延迟539毫秒）

slave发现和master心跳超时时间： 没有发生，因为 slave在超时之前已成为master

slave收到其它master发来的自己的master为 fail 时间：20:37:10.398

切换前服务最后一次正常时间：20:36:55/266889（服务异常约发生在 56秒）

切换后服务恢复正常时间：20:37:12/265802

服务不可用时长：约 17秒

7.3. 其它 master 日志

该 master ID 为 c67dc9e02e25f2e6321df8ac2eb4d99789917783 。

30613:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

30613:M 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

30613:M 04 Jan 2019 20:37:11.084 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

30613:M 04 Jan 2019 20:37:11.124 # Cluster state changed: ok

30613:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

7.4. 其它 master 日志

该 master ID 为 bfad383775421b1090eaa7e0b2dcfb3b38455079 。

30614:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

30614:M 04 Jan 2019 20:37:11.085 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

30614:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

7.5. slave 日志

slave 的 master ID 为 44eb43e50c101c5f44f48295c42dda878b6cb3e9 ， slave 自己的 ID 为 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 。

30651:S 04 Jan 2019 20:37:10.398 * FAIL message received from c67dc9e02e25f2e6321df8ac2eb4d99789917783 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

30651:S 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

30651:S 04 Jan 2019 20:37:10.475 # Start of election delayed for 539 milliseconds (rank #0, offset 228620).

30651:S 04 Jan 2019 20:37:11.077 # Starting a failover election for epoch 32.

30651:S 04 Jan 2019 20:37:11.100 # Failover election won: I'm the new master.

30651:S 04 Jan 2019 20:37:11.100 # configEpoch set to 32 after successful failover

30651:M 04 Jan 2019 20:37:11.100 # Setting secondary replication ID to 0cf19d01597610c7933b7ed67c999a631655eafc, valid up to offset: 228621. New replication ID is 53daa7fa265d982aebd3c18c07ed5f178fc3f70b

30651:M 04 Jan 2019 20:37:11.101 # Connection with master lost.

30651:M 04 Jan 2019 20:37:11.101 * Caching the disconnected master state.

30651:M 04 Jan 2019 20:37:11.101 * Discarding previously cached master state.

30651:M 04 Jan 2019 20:37:11.101 # Cluster state changed: ok

30651:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

8. slave 延迟发起选举代码

// 摘自 Redis-5.0.3

// cluster.c

/* This function is called if we are a slave node and our master serving

* a non-zero amount of hash slots is in FAIL state.

* The gaol of this function is:

* 1) To check if we are able to perform a failover, is our data updated?

* 2) Try to get elected by masters.

* 3) Perform the failover informing all the other nodes.

void clusterHandleSlave Failover (void) {

。。。。。。

/* Check if our data is recent enough according to the slave validity

* factor configured by the user.

* Check bypassed for manual failovers. */

if (server.cluster_slave_validity_factor &&

data_age >

(((mstime_t)server.repl_ping_slave_period * 1000) +

(server.cluster_node_timeout * server.cluster_slave_validity_factor)))

{

if (!manual_failover) {

clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);

return;

}

/* If the previous failover attempt timedout and the retry time has

* elapsed, we can setup a new one. */

if (auth_age > auth_retry_time) {

server.cluster-> failover_auth_time = mstime() +

500 + /* Fixed delay of 500 milliseconds , let FAIL msg propagate. */

random () % 500; /* Random delay between 0 and 500 milliseconds. */

server.cluster->failover_auth_count = 0;

server.cluster->failover_auth_sent = 0;

server.cluster->failover_auth_rank = clusterGetSlaveRank();

/* We add another delay that is proportional to the slave rank.

* Specifically 1 second * rank. This way slaves that have a probably

* less updated replication offset, are penalized. */

server.cluster-> failover_auth_time +=

server.cluster-> failover_auth_rank * 1000;

/* However if this is a manual failover, no delay is needed. */

if (server.cluster->mf_end) {

server.cluster->failover_auth_time = mstime();

server.cluster->failover_auth_rank = 0;

}

serverLog(LL_WARNING,

"Start of election delayed for %lld milliseconds "

"(rank #%d, offset %lld).",

server.cluster->failover_auth_time - mstime(),

server.cluster->failover_auth_rank,

replicationGetSlaveOffset());

/* Now that we have a scheduled election, broadcast our offset

* to all the other slaves so that they'll updated their offsets

* if our offset is better. */

clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);

return;

}

。。。。。。

}

以上所述就是小编给大家介绍的《Redis集群的主从切换研究》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

深入理解SPARK

耿嘉安 / 机械工业出版社 / 2016-1-1 / 99

《深入理解SPARK：核心思想与源码分析》结合大量图和示例，对Spark的架构、部署模式和工作模块的设计理念、实现源码与使用技巧进行了深入的剖析与解读。《深入理解SPARK：核心思想与源码分析》一书对Spark1.2.0版本的源代码进行了全面而深入的分析，旨在为Spark的优化、定制和扩展提供原理性的指导。阿里巴巴集团专家鼎力推荐、阿里巴巴资深Java开发和大数据专家撰写。本书分为......一起来看看《深入理解SPARK》这本书的介绍吧!

码农工具