MySQL GTID 日常维护案例两则

栏目: 数据库 · 发布时间: 6年前

内容简介:在测试环境开启GTID运行一年多之后,我们准备近期上线生产。为了保证GTID顺利的上线,在测试环境模拟各种故障场景,观察GTID 的表现案例采用sandbox测试,GTID的一些关键测试参数如下其中Master的UUID是00021800-1111-1111-1111-111111111111,Slave的UUID是 00021802-3333-3333-3333-333333333333,我们简称 Master的UUID为1111,Slave的UUID为3333。

前言

在测试环境开启GTID运行一年多之后,我们准备近期上线生产。为了保证GTID顺利的上线,在测试环境模拟各种故障场景,观察GTID 的表现

案例采用sandbox测试,GTID的一些关键测试参数如下

master-info-repository=table
relay-log-info-repository=table
gtid_mode=ON
log-slave-updates
enforce-gtid-consistency
relay_log_recovery=on

案例一

其中Master的UUID是00021800-1111-1111-1111-111111111111,Slave的UUID是 00021802-3333-3333-3333-333333333333,我们简称 Master的UUID为1111,Slave的UUID为3333。

时间线 Master Slave 备注
t1 CREATE TABLE t1 生成的GTID为1111:1856
t2 STOP SLAVE SQL_THREAD 停止从库的SQL THREAD 线程
t3 RESET MASTER; 清空从库的gtid_executed和binlog信息
t4 SET GLOBAL GTID_PURGED='1111:1-1857' 手动让从库认为1111:1857已经执行
t5 DROP TABLE t1 主库删除t1,GTID为1111:1857
t6 CREATE TABLE t1 主库创建t1,GTID为1111:1858
t7 START SLAVE SQL_THREAD 启动从库 SQL 线程
t8 SHOW SLAVE STATUS\G Error 'Table 't1' already exists'
t9 STOP SLAVE SQL_THREAD;
t10 RESET MASTER
t11 SET GLOBAL GTID_PURGED='1111:1-1856' 让从库重新去执行1111:1857的事务
t12 START SLAVE SQL_THREAD; Error 'Table 't1' already exists' on query

我们发现当我们不小心在从库让 SQL THREAD 跳过一个GTID事务的时候,想让从库重新执行这个跳过的GTID,却始终报错,我们把从库的线程的关键信息提供一下

Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 2768
               Relay_Log_File: mysql-relay.000006
                Relay_Log_Pos: 594
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 2606       
               Last_SQL_Errno: 1050
               Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)'
  Replicate_Ignore_Server_Ids:
           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1856
                Auto_Position: 1

刚刚我们执行的操作只是关闭了SQL THREAD,那么主库执行的DROP TABLE操作(被从库第一次手动跳过的)以及CREATE 操作都应该在relay log里面,我们解析一下,结果删去了一些不必要的输出。

$ mysqlbinlog mysql-relay.000006
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 10:56:36 server id 102  end_log_pos 123 CRC32 0x73fa3393   Start: binlog v 4, server v 5.7.25-28-log created 190308 10:56:36
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 10:56:36 server id 102  end_log_pos 154 CRC32 0x4c964c64   Previous-GTIDs
# [empty]
# at 154
#700101  8:00:00 server id 1  end_log_pos 0 CRC32 0x6677862d   Rotate to mysql-bin.000009  pos: 4
# at 201
#190308  9:30:10 server id 1  end_log_pos 123 CRC32 0x536ddf69     Start: binlog v 4, server v 5.7.25-28-log created 190308  9:30:10

# at 320
#190308 10:56:36 server id 0  end_log_pos 367 CRC32 0x72b7d6b4     Rotate to mysql-bin.000009  pos: 194
# at 367
#190308 10:56:36 server id 0  end_log_pos 414 CRC32 0x73628a00     Rotate to mysql-bin.000009  pos: 2426
# at 414
#190308 10:57:44 server id 1  end_log_pos 2491 CRC32 0x18c5896c    GTID    last_committed=13   sequence_number=14  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1857'/*!*/;
# at 479
#190308 10:57:44 server id 1  end_log_pos 2606 CRC32 0x7c575de0    Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
DROP TABLE `t1` /* generated by server */  删除表的操作
/*!*/;
# at 594
#190308 10:57:48 server id 1  end_log_pos 2671 CRC32 0x7138d3a8    GTID    last_committed=14   sequence_number=15  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/;
# at 659
#190308 10:57:48 server id 1  end_log_pos 2768 CRC32 0x3a8eeb36    Query   thread_id=4 exec_time=0 error_code=0
SET TIMESTAMP=1552013868/*!*/;
create table t1(id int)                创建表的操作
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

我们可以看到 GTID 1111:1857 和 GTID 1111:1858 都在relay log里面,但是SQL thread 始终没有执行,我们看看为什么

Relay_Log_Pos = 594 表示的是SQL thread 执行的位点在relay log 的位置

我们看看594 在relay log 是什么event

# at 594
#190308 10:57:48 server id 1  end_log_pos 2671 CRC32 0x7138d3a8    GTID    last_committed=14   sequence_number=15  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1858'/*!*/;
# at 659
#190308 10:57:48 server id 1  end_log_pos 2768 CRC32 0x3a8eeb36    Query   thread_id=4 exec_time=0 error_code=0
SET TIMESTAMP=1552013868/*!*/;
create table t1(id int)

但是DROP TABLE 的位点是在 414 的位点,而SQL thread 位点却在594,所以始终没办法执行到 414 位点的DROP TABLE 操作,无法修复,那是为啥呢?

因为414 的DROP TABLE 操作的GTID 是 1111:1857 ,被我们人为的认为从库已经执行,当从库线程发现1111:1857已经执行过,则跳过该事务,继续往下执行回放。当执行到594位点,即GTID为1111:1858。这时候,当我们在t11时刻,设置gtid_purged=1111:1-1856',让从库可以重新执行GTID 为1111:1157 的事务,SQL thread 还是从relay log 的594位点开始执行,无法恢复。那怎么修复呢?

解决方案

1.让SQL thread 从414 位点开始执行

STOP SLAVE;
CHANGE MASTER TO RELAY_LOG_FILE='mysql-relay.000006' ,RELAY_LOG_POS=414 ,MASTER_AUTO_POSITION=0;
CHANGE MASTER TO MASTER_AUTO_POSITION=1;
START SLAVE;
SHOW SLAVE STATUS\G

我们再观察下slave的状态

Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 2768
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 756
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

          Exec_Master_Log_Pos: 2768

           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1857-1858
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858

我们看到1111:1157 已经被加入Executed_Gtid_Set。

2.重新启动MySQL Slave

因为relay_log_recovery为on的状态,老的relay log 会被删除,重新根据gtid_executed 去主库拉取binlog,这时候就会顺利往下执行了。

案例二

时间线 Master Slave 备注
t1 CREATE TABLE t1 生成的GTID为1111:1858
t2 STOP SLAVE 停止从库的线程
t3 RESET MASTER; 清空从库的gtid_executed和binlog信息
t4 SET GLOBAL GTID_PURGED='1111:1-1859' 手动让从库认为1111:1859已经执行
t5 DROP TABLE t1 主库删除t1,GTID为1111:1859
t6 CREATE TABLE t1 主库创建t1,GTID为1111:1860
t7 START SLAVE 启动从库线程
t8 SHOW SLAVE STATUS\G Error 'Table 't1' already exists'
t9 STOP SLAVE;
t10 RESET MASTER
t11 SET GLOBAL GTID_PURGED='1111:1-1858' 让从库重新去执行1111:1859的事务
t12 START SLAVE; Error 'Table 't1' already exists' on query

这次和案例一不一样的地方是本次停止的是整个从库线程,包括IO线程和SQL线程。在启动前IO线程是没有拿到DROP的GTID为1111:1159的事务的,启动SLAVE线程后 MySQL 会怎么处理呢?

我们查一下官方文档,发现了以下这句话

 In the initial handshake, the slave sends a GTID set containing the transactions that it has already received, committed, or both. This GTID set is equal to the union of the set of GTIDs in the gtid_executed system variable (@@GLOBAL.gtid_executed), and the set of GTIDs recorded in the Performance Schema replication_connection_status table as received transactions (the result of the statement SELECT RECEIVED_TRANSACTION_SET FROM PERFORMANCE_SCHEMA.replication_connection_status). 

https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-auto-positioning.html

也就是说,当从库连上主库的时候,从库会发送一个GTID集合,可能是已经执行的GTID集合,也可能是已经收到的GTID集合,或者是两者。这个GTID是已经执行的GTID集合和已经收到的GTID集合的并集。

当t7 时刻启动slave的时候,已经执行的GTID集合为1111:1-1859,已经收到的GTID集合为1111:1-1858,发送给主库的GTID集合为1111:1-1859。主库需要发送给从库的GTID为1111:1860,也就是CREATE TABLE语句,我们去relay log里面验证下

当前slave 的状态

Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 3110
               Relay_Log_File: mysql-relay.000005
                Relay_Log_Pos: 414
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 2948
               Last_SQL_Errno: 1050
               Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1(id int)'
           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1858

我们看下 mysql-relay.000005的内容

$mysqlbinlog mysql-relay.000005
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 12:15:33 server id 102  end_log_pos 123 CRC32 0x1f2e3e22   Start: binlog v 4, server v 5.7.25-28-log created 190308 12:15:33
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 12:15:33 server id 102  end_log_pos 154 CRC32 0x5422dcce   Previous-GTIDs
# [empty]
# at 154
#700101  8:00:00 server id 1  end_log_pos 0 CRC32 0x6677862d   Rotate to mysql-bin.000009  pos: 4
# at 201
#190308  9:30:10 server id 1  end_log_pos 123 CRC32 0x536ddf69     Start: binlog v 4, server v 5.7.25-28-log created 190308  9:30:10
# at 320
#190308 12:15:33 server id 0  end_log_pos 367 CRC32 0x182ec41f     Rotate to mysql-bin.000009  pos: 194
# at 367
#190308 12:15:33 server id 0  end_log_pos 414 CRC32 0x7c85ab34     Rotate to mysql-bin.000009  pos: 2948
# at 414
#190308 12:15:27 server id 1  end_log_pos 3013 CRC32 0x4ab790eb    GTID    last_committed=16   sequence_number=17  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/;
# at 479
#190308 12:15:27 server id 1  end_log_pos 3110 CRC32 0xe74d818a    Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
create table t1(id int)        建表的操作
/*!*/;
# at 576
#190308 12:26:37 server id 102  end_log_pos 625 CRC32 0x532b5fe7   Rotate to mysql-relay.000006  pos: 4
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

我们发现 Relay_Log_Pos: 414 是CREATE TABLE的语句,GTID为1111:1860,在relay log mysql-relay.000005未发现GTID 为1111:1859 的DROP语句,我们发现后面还要一个relay log为mysql-relay.000006,我们看看里面的内容

$mysqlbinlog mysql-relay.000006
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 12:26:37 server id 102  end_log_pos 123 CRC32 0x4357741b   Start: binlog v 4, server v 5.7.25-28-log created 190308 12:26:37
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 12:26:37 server id 102  end_log_pos 194 CRC32 0xa087d1c5   Previous-GTIDs
# 00021800-1111-1111-1111-111111111111:1860
# at 194
#700101  8:00:00 server id 1  end_log_pos 0 CRC32 0x6677862d   Rotate to mysql-bin.000009  pos: 4
# at 241
#190308  9:30:10 server id 1  end_log_pos 123 CRC32 0x536ddf69     Start: binlog v 4, server v 5.7.25-28-log created 190308  9:30:10

# at 360
#190308 12:26:37 server id 0  end_log_pos 407 CRC32 0xaea58e04     Rotate to mysql-bin.000009  pos: 194
# at 407
#190308 12:26:37 server id 0  end_log_pos 454 CRC32 0xb19784ba     Rotate to mysql-bin.000009  pos: 2768
# at 454
#190308 12:15:23 server id 1  end_log_pos 2833 CRC32 0x2279702d    GTID    last_committed=15   sequence_number=16  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/;
# at 519
#190308 12:15:23 server id 1  end_log_pos 2948 CRC32 0xe93f8d9d    Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;

DROP TABLE `t1` /* generated by server */   删表的操作
/*!*/;
# at 634
#190308 12:26:37 server id 0  end_log_pos 681 CRC32 0xd7fd7810     Rotate to mysql-bin.000009  pos: 3110
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

我们发现GTID为1111:1859的事务在1111:1860的后面,这也是符合预期的,因为在t7时刻启动从库线程的时候,主库需要发给从库的GTID只有1111:1860,因此mysql-relay.000005里面只有1111:1860,当我们在t11时刻执行SET GLOBAL GTID_PURGED='1111:1-1858'的时候,从库发送给主库的GTID集合为 1111:1-1858和 1111:1-1858:1860合集,也就是1111:1-1158:1860,这时候主库需要发送给从库的GTID 为1111:1859。

但是由于1111:1859 在 1111:1860 的后面,因此始终无法执行1111:1859的DROP事务,导致一直报错。

解决方案

重启MySQL Slave

在t12操作之后,直接重启从库,此时,mysql-relay.000005 和 mysql-relay.000006 都被删除,这时候由于gtid_executed 变量为1111:1-1858,那么从主库重新拉取的GTID为1111:1859-1860,我们从relay log mysql-relay.000007 和 mysql-relay.000008 里面验证一下

由于mysql-relay.000007 里面没有包含实际的事务信息,我们直接看mysql-relay.000008

$mysqlbinlog mysql-relay.000008
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#190308 13:22:11 server id 102  end_log_pos 123 CRC32 0xd3d8ddd6   Start: binlog v 4, server v 5.7.25-28-log created 190308 13:22:11
# This Format_description_event appears in a relay log and was generated by the slave thread.
# at 123
#190308 13:22:11 server id 102  end_log_pos 154 CRC32 0x5c6327ce   Previous-GTIDs
# [empty]
# at 154
#700101  8:00:00 server id 1  end_log_pos 0 CRC32 0x6677862d   Rotate to mysql-bin.000009  pos: 4
# at 201
#190308  9:30:10 server id 1  end_log_pos 123 CRC32 0x536ddf69     Start: binlog v 4, server v 5.7.25-28-log created 190308  9:30:10
BINLOG '
osWBXA8BAAAAdwAAAHsAAAAAAAQANS43LjI1LTI4LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA
AWnfbVM=
'/*!*/;
# at 320
#190308 13:22:11 server id 0  end_log_pos 367 CRC32 0x08b48d1c     Rotate to mysql-bin.000009  pos: 194
# at 367
#190308 13:22:11 server id 0  end_log_pos 414 CRC32 0xc7dd1334     Rotate to mysql-bin.000009  pos: 2768
# at 414
#190308 12:15:23 server id 1  end_log_pos 2833 CRC32 0x2279702d    GTID    last_committed=15   sequence_number=16  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1859'/*!*/;
# at 479
#190308 12:15:23 server id 1  end_log_pos 2948 CRC32 0xe93f8d9d    Query   thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
DROP TABLE `t1` /* generated by server */
/*!*/;
# at 594
#190308 12:15:27 server id 1  end_log_pos 3013 CRC32 0x4ab790eb    GTID    last_committed=16   sequence_number=17  rbr_only=no
SET @@SESSION.GTID_NEXT= '00021800-1111-1111-1111-111111111111:1860'/*!*/;
# at 659
#190308 12:15:27 server id 1  end_log_pos 3110 CRC32 0xe74d818a    Query   thread_id=4 exec_time=0 error_code=0
SET TIMESTAMP=1552018527/*!*/;
create table t1(id int)
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

我们看到mysql-relay.000008 包含了1111:1859-1860,从库SQL线程也恢复正常状态

Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 3110
               Relay_Log_File: mysql-relay.000008
                Relay_Log_Pos: 756
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
           Retrieved_Gtid_Set: 00021800-1111-1111-1111-111111111111:1859-1860
            Executed_Gtid_Set: 00021800-1111-1111-1111-111111111111:1-1860

结语

GTID在上线生产的时候,需要多做测试,明白其底层的原理,并不是所有未执行的GTID事务一定会被执行,也需要注意GTID的事务的执行顺序。重启大法还是很有用的哈哈哈


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Web Applications (Hacking Exposed)

Web Applications (Hacking Exposed)

Joel Scambray、Mike Shema / McGraw-Hill Osborne Media / 2002-06-19 / USD 49.99

Get in-depth coverage of Web application platforms and their vulnerabilities, presented the same popular format as the international bestseller, Hacking Exposed. Covering hacking scenarios across diff......一起来看看 《Web Applications (Hacking Exposed)》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

html转js在线工具
html转js在线工具

html转js在线工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具