内容简介:在测试环境开启GTID运行一年多之后,我们准备近期上线生产。为了保证GTID顺利的上线,在测试环境模拟各种故障场景,观察GTID 的表现案例采用sandbox测试,GTID的一些关键测试参数如下其中Master的UUID是00021800-1111-1111-1111-111111111111,Slave的UUID是 00021802-3333-3333-3333-333333333333,我们简称 Master的UUID为1111,Slave的UUID为3333。
前言
在测试环境开启GTID运行一年多之后,我们准备近期上线生产。为了保证GTID顺利的上线,在测试环境模拟各种故障场景,观察GTID 的表现
案例采用sandbox测试,GTID的一些关键测试参数如下
master-info-repository=table relay-log-info-repository=table gtid_mode=ON log-slave-updates enforce-gtid-consistency relay_log_recovery=on
案例一
其中Master的UUID是00021800-1111-1111-1111-111111111111,Slave的UUID是 00021802-3333-3333-3333-333333333333,我们简称 Master的UUID为1111,Slave的UUID为3333。
时间线 | Master | Slave | 备注 |
---|---|---|---|
t1 | CREATE TABLE t1 | 生成的GTID为1111:1856 | |
t2 | STOP SLAVE SQL_THREAD | 停止从库的SQL THREAD 线程 | |
t3 | RESET MASTER; | 清空从库的gtid_executed和binlog信息 | |
t4 | SET GLOBAL GTID_PURGED='1111:1-1857' | 手动让从库认为1111:1857已经执行 | |
t5 | DROP TABLE t1 | 主库删除t1,GTID为1111:1857 | |
t6 | CREATE TABLE t1 | 主库创建t1,GTID为1111:1858 | |
t7 | START SLAVE SQL_THREAD | 启动从库 SQL 线程 | |
t8 | SHOW SLAVE STATUS\G | Error 'Table 't1' already exists' | |
t9 | STOP SLAVE SQL_THREAD; | ||
t10 | RESET MASTER | ||
t11 | SET GLOBAL GTID_PURGED='1111:1-1856' | 让从库重新去执行1111:1857的事务 | |
t12 | START SLAVE SQL_THREAD; | Error 'Table 't1' already exists' on query |
我们发现当我们不小心在从库让 SQL THREAD 跳过一个GTID事务的时候,想让从库重新执行这个跳过的GTID,却始终报错,我们把从库的线程的关键信息提供一下
Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 2768 Relay_Log_File: mysql-relay.000006 Relay_Log_Pos: 594 Relay_Master_Log_File: mysql-bin.000009 Slave_IO_Running: Yes Slave_SQL_Running: No Exec_Master_Log_Pos: 2606 Last_SQL_Errno: 1050 Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)' Replicate_Ignore_Server_Ids: Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858 Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1856 Auto_Position: 1
刚刚我们执行的操作只是关闭了SQL THREAD,那么主库执行的DROP TABLE操作(被从库第一次手动跳过的)以及CREATE 操作都应该在relay log里面,我们解析一下,结果删去了一些不必要的输出。
$ mysqlbinlog mysql-relay.000006 /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #190308 10:56:36 server id 102 end_log_pos 123 CRC32 0x73fa3393 Start: binlog v 4, server v 5.7.25-28-log created 190308 10:56:36 # This Format_description_event appears in a relay log and was generated by the slave thread. # at 123 #190308 10:56:36 server id 102 end_log_pos 154 CRC32 0x4c964c64 Previous-GTIDs # [empty] # at 154 #700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4 # at 201 #190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10 # at 320 #190308 10:56:36 server id 0 end_log_pos 367 CRC32 0x72b7d6b4 Rotate to mysql-bin.000009 pos: 194 # at 367 #190308 10:56:36 server id 0 end_log_pos 414 CRC32 0x73628a00 Rotate to mysql-bin.000009 pos: 2426 # at 414 #190308 10:57:44 server id 1 end_log_pos 2491 CRC32 0x18c5896c GTID last_committed=13 sequence_number=14 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1857'/*!*/; # at 479 #190308 10:57:44 server id 1 end_log_pos 2606 CRC32 0x7c575de0 Query thread_id=4 exec_time=0 error_code=0 use `test`/*!*/; DROP TABLE `t1` /* generated by server */ 删除表的操作 /*!*/; # at 594 #190308 10:57:48 server id 1 end_log_pos 2671 CRC32 0x7138d3a8 GTID last_committed=14 sequence_number=15 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/; # at 659 #190308 10:57:48 server id 1 end_log_pos 2768 CRC32 0x3a8eeb36 Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1552013868/*!*/; create table t1(id int) 创建表的操作 /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们可以看到 GTID 1111:1857 和 GTID 1111:1858 都在relay log里面,但是SQL thread 始终没有执行,我们看看为什么
Relay_Log_Pos = 594 表示的是SQL thread 执行的位点在relay log 的位置
我们看看594 在relay log 是什么event
# at 594 #190308 10:57:48 server id 1 end_log_pos 2671 CRC32 0x7138d3a8 GTID last_committed=14 sequence_number=15 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/; # at 659 #190308 10:57:48 server id 1 end_log_pos 2768 CRC32 0x3a8eeb36 Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1552013868/*!*/; create table t1(id int)
但是DROP TABLE 的位点是在 414 的位点,而SQL thread 位点却在594,所以始终没办法执行到 414 位点的DROP TABLE 操作,无法修复,那是为啥呢?
因为414 的DROP TABLE 操作的GTID 是 1111:1857 ,被我们人为的认为从库已经执行,当从库线程发现1111:1857已经执行过,则跳过该事务,继续往下执行回放。当执行到594位点,即GTID为1111:1858。这时候,当我们在t11时刻,设置gtid_purged=1111:1-1856',让从库可以重新执行GTID 为1111:1157 的事务,SQL thread 还是从relay log 的594位点开始执行,无法恢复。那怎么修复呢?
解决方案
1.让SQL thread 从414 位点开始执行
STOP SLAVE; CHANGE MASTER TO RELAY_LOG_FILE='mysql-relay.000006' ,RELAY_LOG_POS=414 ,MASTER_AUTO_POSITION=0; CHANGE MASTER TO MASTER_AUTO_POSITION=1; START SLAVE; SHOW SLAVE STATUS\G
我们再观察下slave的状态
Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 2768 Relay_Log_File: mysql-relay.000002 Relay_Log_Pos: 756 Relay_Master_Log_File: mysql-bin.000009 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: 2768 Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858 Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858
我们看到1111:1157 已经被加入Executed_Gtid_Set。
2.重新启动MySQL Slave
因为relay_log_recovery为on的状态,老的relay log 会被删除,重新根据gtid_executed 去主库拉取binlog,这时候就会顺利往下执行了。
案例二
时间线 | Master | Slave | 备注 |
---|---|---|---|
t1 | CREATE TABLE t1 | 生成的GTID为1111:1858 | |
t2 | STOP SLAVE | 停止从库的线程 | |
t3 | RESET MASTER; | 清空从库的gtid_executed和binlog信息 | |
t4 | SET GLOBAL GTID_PURGED='1111:1-1859' | 手动让从库认为1111:1859已经执行 | |
t5 | DROP TABLE t1 | 主库删除t1,GTID为1111:1859 | |
t6 | CREATE TABLE t1 | 主库创建t1,GTID为1111:1860 | |
t7 | START SLAVE | 启动从库线程 | |
t8 | SHOW SLAVE STATUS\G | Error 'Table 't1' already exists' | |
t9 | STOP SLAVE; | ||
t10 | RESET MASTER | ||
t11 | SET GLOBAL GTID_PURGED='1111:1-1858' | 让从库重新去执行1111:1859的事务 | |
t12 | START SLAVE; | Error 'Table 't1' already exists' on query |
这次和案例一不一样的地方是本次停止的是整个从库线程,包括IO线程和SQL线程。在启动前IO线程是没有拿到DROP的GTID为1111:1159的事务的,启动SLAVE线程后 MySQL 会怎么处理呢?
我们查一下官方文档,发现了以下这句话
In the initial handshake, the slave sends a GTID set containing the transactions that it has already received, committed, or both. This GTID set is equal to the union of the set of GTIDs in the gtid_executed system variable (@@GLOBAL.gtid_executed), and the set of GTIDs recorded in the Performance Schema replication_connection_status table as received transactions (the result of the statement SELECT RECEIVED_TRANSACTION_SET FROM PERFORMANCE_SCHEMA.replication_connection_status).
https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-auto-positioning.html
也就是说,当从库连上主库的时候,从库会发送一个GTID集合,可能是已经执行的GTID集合,也可能是已经收到的GTID集合,或者是两者。这个GTID是已经执行的GTID集合和已经收到的GTID集合的并集。
当t7 时刻启动slave的时候,已经执行的GTID集合为1111:1-1859,已经收到的GTID集合为1111:1-1858,发送给主库的GTID集合为1111:1-1859。主库需要发送给从库的GTID为1111:1860,也就是CREATE TABLE语句,我们去relay log里面验证下
当前slave 的状态
Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 3110 Relay_Log_File: mysql-relay.000005 Relay_Log_Pos: 414 Relay_Master_Log_File: mysql-bin.000009 Slave_IO_Running: Yes Slave_SQL_Running: No Exec_Master_Log_Pos: 2948 Last_SQL_Errno: 1050 Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)' Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860 Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858
我们看下 mysql-relay.000005的内容
$mysqlbinlog mysql-relay.000005 /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #190308 12:15:33 server id 102 end_log_pos 123 CRC32 0x1f2e3e22 Start: binlog v 4, server v 5.7.25-28-log created 190308 12:15:33 # This Format_description_event appears in a relay log and was generated by the slave thread. # at 123 #190308 12:15:33 server id 102 end_log_pos 154 CRC32 0x5422dcce Previous-GTIDs # [empty] # at 154 #700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4 # at 201 #190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10 # at 320 #190308 12:15:33 server id 0 end_log_pos 367 CRC32 0x182ec41f Rotate to mysql-bin.000009 pos: 194 # at 367 #190308 12:15:33 server id 0 end_log_pos 414 CRC32 0x7c85ab34 Rotate to mysql-bin.000009 pos: 2948 # at 414 #190308 12:15:27 server id 1 end_log_pos 3013 CRC32 0x4ab790eb GTID last_committed=16 sequence_number=17 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/; # at 479 #190308 12:15:27 server id 1 end_log_pos 3110 CRC32 0xe74d818a Query thread_id=4 exec_time=0 error_code=0 use `test`/*!*/; create table t1(id int) 建表的操作 /*!*/; # at 576 #190308 12:26:37 server id 102 end_log_pos 625 CRC32 0x532b5fe7 Rotate to mysql-relay.000006 pos: 4 SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们发现 Relay_Log_Pos: 414 是CREATE TABLE的语句,GTID为1111:1860,在relay log mysql-relay.000005未发现GTID 为1111:1859 的DROP语句,我们发现后面还要一个relay log为mysql-relay.000006,我们看看里面的内容
$mysqlbinlog mysql-relay.000006 /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #190308 12:26:37 server id 102 end_log_pos 123 CRC32 0x4357741b Start: binlog v 4, server v 5.7.25-28-log created 190308 12:26:37 # This Format_description_event appears in a relay log and was generated by the slave thread. # at 123 #190308 12:26:37 server id 102 end_log_pos 194 CRC32 0xa087d1c5 Previous-GTIDs # 00021800-1111-1111-1111-111111111111:1860 # at 194 #700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4 # at 241 #190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10 # at 360 #190308 12:26:37 server id 0 end_log_pos 407 CRC32 0xaea58e04 Rotate to mysql-bin.000009 pos: 194 # at 407 #190308 12:26:37 server id 0 end_log_pos 454 CRC32 0xb19784ba Rotate to mysql-bin.000009 pos: 2768 # at 454 #190308 12:15:23 server id 1 end_log_pos 2833 CRC32 0x2279702d GTID last_committed=15 sequence_number=16 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/; # at 519 #190308 12:15:23 server id 1 end_log_pos 2948 CRC32 0xe93f8d9d Query thread_id=4 exec_time=0 error_code=0 use `test`/*!*/; DROP TABLE `t1` /* generated by server */ 删表的操作 /*!*/; # at 634 #190308 12:26:37 server id 0 end_log_pos 681 CRC32 0xd7fd7810 Rotate to mysql-bin.000009 pos: 3110 SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们发现GTID为1111:1859的事务在1111:1860的后面,这也是符合预期的,因为在t7时刻启动从库线程的时候,主库需要发给从库的GTID只有1111:1860,因此mysql-relay.000005里面只有1111:1860,当我们在t11时刻执行SET GLOBAL GTID_PURGED='1111:1-1858'的时候,从库发送给主库的GTID集合为 1111:1-1858和 1111:1-1858:1860合集,也就是1111:1-1158:1860,这时候主库需要发送给从库的GTID 为1111:1859。
但是由于1111:1859 在 1111:1860 的后面,因此始终无法执行1111:1859的DROP事务,导致一直报错。
解决方案
重启MySQL Slave
在t12操作之后,直接重启从库,此时,mysql-relay.000005 和 mysql-relay.000006 都被删除,这时候由于gtid_executed 变量为1111:1-1858,那么从主库重新拉取的GTID为1111:1859-1860,我们从relay log mysql-relay.000007 和 mysql-relay.000008 里面验证一下
由于mysql-relay.000007 里面没有包含实际的事务信息,我们直接看mysql-relay.000008
$mysqlbinlog mysql-relay.000008 /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #190308 13:22:11 server id 102 end_log_pos 123 CRC32 0xd3d8ddd6 Start: binlog v 4, server v 5.7.25-28-log created 190308 13:22:11 # This Format_description_event appears in a relay log and was generated by the slave thread. # at 123 #190308 13:22:11 server id 102 end_log_pos 154 CRC32 0x5c6327ce Previous-GTIDs # [empty] # at 154 #700101 8:00:00 server id 1 end_log_pos 0 CRC32 0x6677862d Rotate to mysql-bin.000009 pos: 4 # at 201 #190308 9:30:10 server id 1 end_log_pos 123 CRC32 0x536ddf69 Start: binlog v 4, server v 5.7.25-28-log created 190308 9:30:10 BINLOG ' osWBXA8BAAAAdwAAAHsAAAAAAAQANS43LjI1LTI4LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA AWnfbVM= '/*!*/; # at 320 #190308 13:22:11 server id 0 end_log_pos 367 CRC32 0x08b48d1c Rotate to mysql-bin.000009 pos: 194 # at 367 #190308 13:22:11 server id 0 end_log_pos 414 CRC32 0xc7dd1334 Rotate to mysql-bin.000009 pos: 2768 # at 414 #190308 12:15:23 server id 1 end_log_pos 2833 CRC32 0x2279702d GTID last_committed=15 sequence_number=16 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/; # at 479 #190308 12:15:23 server id 1 end_log_pos 2948 CRC32 0xe93f8d9d Query thread_id=4 exec_time=0 error_code=0 use `test`/*!*/; DROP TABLE `t1` /* generated by server */ /*!*/; # at 594 #190308 12:15:27 server id 1 end_log_pos 3013 CRC32 0x4ab790eb GTID last_committed=16 sequence_number=17 rbr_only=no SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/; # at 659 #190308 12:15:27 server id 1 end_log_pos 3110 CRC32 0xe74d818a Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1552018527/*!*/; create table t1(id int) /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
我们看到mysql-relay.000008 包含了1111:1859-1860,从库SQL线程也恢复正常状态
Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 3110 Relay_Log_File: mysql-relay.000008 Relay_Log_Pos: 756 Relay_Master_Log_File: mysql-bin.000009 Slave_IO_Running: Yes Slave_SQL_Running: Yes Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860 Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1860
结语
GTID在上线生产的时候,需要多做测试,明白其底层的原理,并不是所有未执行的GTID事务一定会被执行,也需要注意GTID的事务的执行顺序。重启大法还是很有用的哈哈哈
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- guns-lite 将所有表增加维护人员和维护时间信息
- 如何编写无法维护的代码
- Debian 包维护者辞职
- 从 VantCompoent 谈 小程序维护
- 从 VantComponent 谈 小程序维护
- 开源维护者,必有一战!
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。