mysql主从宕机恢复步骤
在生产环境中经常会出现slave出现错误,从而发生主从同步故障,此时就需要人工干预了。以下是小生整理出的一个回复思路,欢迎大佬指导,分享更好的方法。
宕机恢复分为几种情况:
1.从库数据一致性要求低
2.从库数据一致性要求高
从库数据要求一致性低:
这种情况比较好解决,由于对数据一致性要求比较低,我们可以先把slave起来从而达到热备份的效果(因为之前有做过全量备份和增量备份,所以不用担心数据丢失)。
以目前master对应的pos作为slave起始的pos。
MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.1.11
Master_User: repluser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: matser11.000002
Read_Master_Log_Pos: 1486
Relay_Log_File: mariadb-relay-bin.000002
Relay_Log_Pos: 630
Relay_Master_Log_File: matser11.000002
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1146
Last_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(1,"haha")'
Skip_Counter: 0
Exec_Master_Log_Pos: 919
Relay_Log_Space: 1493
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1146
Last_SQL_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(1,"haha")'
Replicate_Ignore_Server_Ids:
Master_Server_Id: 11
1 row in set (0.00 sec)
ERROR: No query specified
1.将业务主库上锁,阻止对数据的更新
MariaDB [db1]> SET AUTOCOMMIT=0;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> lock tables name read;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> show master status;
+-----------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-----------------+----------+--------------+------------------+
| matser11.000002 | 1486 | | |
+-----------------+----------+--------------+------------------+
1 row in set (0.00 sec)
2.回到从库重新做主从
MariaDB [db1]> stop slave ;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> reset slave all;
Query OK, 0 rows affected (0.01 sec)
MariaDB [db1]> show slave status\G;
Empty set (0.00 sec)
ERROR: No query specified
MariaDB [db1]> change master to master_host="192.168.1.11",master_user="repluser",master_password="123qqq...A",master_log_file="matser11.000002",master_log_pos=1486;
Query OK, 0 rows affected (0.01 sec)
MariaDB [db1]> start slave ;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.1.11
Master_User: repluser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: matser11.000002
Read_Master_Log_Pos: 1486
Relay_Log_File: mariadb-relay-bin.000002
Relay_Log_Pos: 528
Relay_Master_Log_File: matser11.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1486
Relay_Log_Space: 824
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 11
1 row in set (0.00 sec)
1)使用LOCK TALBES虽然可以给InnoDB加表级锁,但必须说明的是,表锁不是由InnoDB存储引擎层管理的,而是由其上一层MySQL Server负责的,仅当autocommit=0、innodb_table_lock=1(默认设置)时,InnoDB层才能知道MySQL加的表锁,MySQL Server才能感知InnoDB加的行锁,这种情况下,InnoDB才能自动识别涉及表级锁的死锁;否则,InnoDB将无法自动检测并处理这种死锁。
2)在用LOCAK TABLES对InnoDB锁时要注意,要将AUTOCOMMIT设为0,否则MySQL不会给表加锁;事务结束前,不要用UNLOCAK TABLES释放表锁,因为UNLOCK TABLES会隐含地提交事务;COMMIT或ROLLBACK产不能释放用LOCAK TABLES加的表级锁,必须用UNLOCK TABLES释放表锁,正确的方式见如下语句。
MariaDB [db1]> COMMIT;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> unlock tables;
Query OK, 0 rows affected (0.00 sec)
这种方法小生并不推荐
skip掉相关错误
停止slave服务
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
开启slave服务并查看状态
MariaDB [db1]> stop slave;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> start slave;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.1.11
Master_User: repluser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: matser11.000001
Read_Master_Log_Pos: 1195
Relay_Log_File: mariadb-relay-bin.000003
Relay_Log_Pos: 528
Relay_Master_Log_File: matser11.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1195
Relay_Log_Space: 2057
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 11
1 row in set (0.00 sec)
这里跳过的是一个事务。当然,也可以跳过多个事务,但要谨慎,毕竟,你并不知道跳过的是什么事务。
建议:可反复执行上述步骤,仔细查看导致从库不能同步的语句。有的时候,阻止从库的事务太多,这种方法就显得略为低效。
可分析主库日志的事务,来确定SQL_SLAVE_SKIP_COUNTER的合适值
数据库一致性要求高
这种情况比较难受,因为生产环境还在继续,你必须以最快的方式恢复到最新的数据。
旧数据可用:
查看宕机时的偏移量或者最后一条sql语句(可根据时间查看),并找出原因,加以解决
MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.1.11
Master_User: repluser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: matser11.000001
Read_Master_Log_Pos: 1385
Relay_Log_File: mariadb-relay-bin.000003
Relay_Log_Pos: 528
Relay_Master_Log_File: matser11.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1146
Last_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(13,"lala")'
Skip_Counter: 0
Exec_Master_Log_Pos: 1195
Relay_Log_Space: 2247
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1146
Last_SQL_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(13,"lala")'
Replicate_Ignore_Server_Ids:
Master_Server_Id: 11
1 row in set (0.00 sec)
ERROR: No query specified
[root@slave mysql]# mysqlbinlog mariadb-relay-bin.000003
;
# at 595
#200821 19:28:18 server id 11 end_log_pos 1358 Query thread_id=14 exec_time=0 error_code=0
use `db1`;
SET TIMESTAMP=1598009298;
insert into name values(13,"lala")
;
# at 691
#200821 19:28:18 server id 11 end_log_pos 1385 Xid = 512
COMMIT;
DELIMITER ;
# End of log file
ROLLBACK ;
;
;
将业务主库上锁,阻止对数据的更新
MariaDB [db1]> SET AUTOCOMMIT=0;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> lock tables name read;
Query OK, 0 rows affected (0.00 sec)
将缺失数据导出
[root@master mysql]# date
2020年 08月 21日 星期五 19:57:42 CST
[root@master mysql]# mysqlbinlog --start-datetime="20-08-21 19:28:18" --stop-datetime="20--8-21 19:57:42" matser11.000001 > data.sql
[root@master mysql]# scp data.sql root@192.168.1.12:/root
恢复数据并查看
[root@slave mysql]# mysql -uroot -p < /root/data.sql
[root@slave mysql]# mysql
....
恢复主从(这里最好重做主从,由于我的是缺失库导致sql线程失败,所以重启slave就ok了)
MariaDB [db1]> stop slave;
MariaDB [db1]> start slave;
MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.1.11
Master_User: repluser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: matser11.000001
Read_Master_Log_Pos: 5185
Relay_Log_File: mariadb-relay-bin.000004
Relay_Log_Pos: 1098
Relay_Master_Log_File: matser11.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 5185
Relay_Log_Space: 5097
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 11
1 row in set (0.00 sec)
ERROR: No query specified
解开主库的锁
MariaDB [db1]> COMMIT;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> unlock tables;
Query OK, 0 rows affected (0.00 sec)
旧数据不可用:
清空原来的主从
同上
使用最新一次的全量备份和增量备份先恢复大量数据。
[root@slave mysql]# mysql -uroot -p < /mysql/all/data.sql
[root@slave mysql]# mysql -uroot -p < /mysql/name/data.sql
最后再合适的时机将业务主库上锁,阻止对数据的更新。
同上
利用binlog日志导出剩余的数据进行恢复。(此时的起始值应该是上次增量备份结束的时间)
同上
恢复主从
同上
解开主库的锁
优化方案
什么?运维小哥没做备份?(不想混了?)嫌麻烦?
那强烈推荐PXC集群。