标题: 基于案例分析 MySQL Group Replication 的故障检测流程 [打印本页] 作者: 惊雷无声 时间: 2022-11-7 14:14 标题: 基于案例分析 MySQL Group Replication 的故障检测流程 故障检测(Failure Detection)是 Group Replication 的一个核心功能模块,通过它可以及时识别集群中的故障节点,并将故障节点从集群中剔除掉。如果不将故障节点及时剔除的话,一方面会影响集群的性能,另一方面还会阻止集群拓扑的变更。
下面结合一个具体的案例,分析 Group Replication 的故障检测流程。
除此之外,本文还会分析以下问题。
2022-07-31T13:03:07.582519-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.30:3306 has become unreachable.'<br>
2022-07-31T13:03:07.690416-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.10:3306 has become unreachable.'<br>2022-07-31T13:03:07.690492-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.20:3306 has become unreachable.'<br>2022-07-31T13:03:07.690504-00:00 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'<br>
mysql> select * from slowtech.t1 where id=1;<br>+----+------+<br>| id | c1 |<br>+----+------+<br>| 1 | a |<br>+----+------+<br>1 row in set (0.00 sec)<br><br>mysql> delete from slowtech.t1 where id=1;<br>阻塞中。。。<br>
# iptables -F<br><br># date "+%Y-%m-%d %H:%M:%S"<br>2022-07-31 13:07:30<br>
复制代码
首先看看 node3 的日志
2022-07-31T13:07:30.464179-00:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.10:3306 is reachable again.'<br>2022-07-31T13:07:30.464226-00:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.20:3306 is reachable again.'<br>2022-07-31T13:07:30.464239-00:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'<br>2022-07-31T13:07:37.458761-00:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'<br>2022-07-31T13:07:37.459011-00:00 0 [Warning] [MY-011630] [Repl] Plugin group_replication reported: 'Due to a plugin error, some transactions were unable to be certified and will now rollback.'<br>2022-07-31T13:07:37.459037-00:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'<br>2022-07-31T13:07:37.459431-00:00 31 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 31'<br>2022-07-31T13:07:37.459478-00:00 31 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed<br>2022-07-31T13:07:37.459811-00:00 33 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'<br><br>2022-07-31T13:07:37.465738-00:00 34 [System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'<br>2022-07-31T13:07:37.496466-00:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'<br>2022-07-31T13:07:37.498813-00:00 36 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 351, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.<br>2022-07-31T13:07:39.653028-00:00 34 [System] [MY-013375] [Repl] Plugin group_replication reported: 'Auto-rejoin procedure attempt 1 of 3 finished. Member was able to join the group.'<br>2022-07-31T13:07:40.653484-00:00 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'<br>2022-07-31T13:07:40.653822-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306, 192.168.244.30:3306 on view 16592724636525403:4.'<br>2022-07-31T13:07:40.670530-00:00 46 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='192.168.244.20', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.<br>2022-07-31T13:07:40.682990-00:00 47 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.<br>2022-07-31T13:07:40.687566-00:00 47 [System] [MY-010562] [Repl] Slave I/O thread for channel 'group_replication_recovery': connected to master 'repl@192.168.244.20:3306',replication started in log 'FIRST' at position 4<br>2022-07-31T13:07:40.717851-00:00 46 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='192.168.244.20', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.<br>2022-07-31T13:07:40.732297-00:00 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'<br>2022-07-31T13:07:40.732511-00:00 53 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'<br>
2022-07-31T13:07:39.555613-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306, 192.168.244.30:3306 on view 16592724636525403:4.'<br>2022-07-31T13:07:40.732568-00:00 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.244.30:3306 was declared online within the replication group.'<br>
[Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Messages that are needed to recover node 192.168.244.30:33061 have been evicted from the message cache. Consider resizing the maximum size of the cache by setting group_replication_message_cache_size.'<br>
复制代码
6. 查看系统表。
除了错误日志,我们还可以通过系统表来判断 XCom Cache 的使用情况。
mysql> select * from performance_schema.memory_summary_global_by_event_name where event_name like "%GCS_XCom::xcom_cache%"\G<br>*************************** 1. row ***************************<br> EVENT_NAME: memory/group_rpl/GCS_XCom::xcom_cache<br> COUNT_ALLOC: 23678<br> COUNT_FREE: 22754<br> SUM_NUMBER_OF_BYTES_ALLOC: 154713397<br> SUM_NUMBER_OF_BYTES_FREE: 28441492<br> LOW_COUNT_USED: 0<br> CURRENT_COUNT_USED: 924<br> HIGH_COUNT_USED: 20992<br> LOW_NUMBER_OF_BYTES_USED: 0<br>CURRENT_NUMBER_OF_BYTES_USED: 126271905<br> HIGH_NUMBER_OF_BYTES_USED: 146137294<br>1 row in set (0.00 sec)<br>
[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Node 0 is unable to get message {4aec99ca 7562 0}, since the group is too far ahead. Node will now exit.'<br>[ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'<br>[ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'<br>[System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'<br>[System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'<br>
复制代码
注意事项
如果集群中存在 UNREACHABLE 的节点,会有以下限制和不足:
不能调整集群的拓扑,包括添加和删除节点。
在单主模式下,如果 Primary 节点出现故障了,无法选择新主。
如果 Group Replication 的一致性级别等于 AFTER 或 BEFORE_AND_AFTER,则写操作会一直等待,直到 UNREACHABLE 节点 ONLINE 并应用该操作。
集群吞吐量会下降。如果是单主模式,可将 group_replication_paxos_single_leader (MySQL 8.0.27 引入的)设置为 ON 解决这个问题。