1、出現了大量很小的relay log如下,堆積量大約2600個: ... -rw-r----- 1 mysql dba 12827 Oct 11 12:28 mysql-relay-bin.036615 -rw-r----- 1 mysql dba 4908 Oct 11 12:28 mysql-relay-bin.036616 -rw-r----- 1 mysql dba 1188 Oct 11 12:28 mysql-relay-bin.036617 -rw-r----- 1 mysql dba 5823 Oct 11 12:29 mysql-relay-bin.036618 -rw-r----- 1 mysql dba 507 Oct 11 12:29 mysql-relay-bin.036619 -rw-r----- 1 mysql dba 1188 Oct 11 12:29 mysql-relay-bin.036620 -rw-r----- 1 mysql dba 3203 Oct 11 12:29 mysql-relay-bin.036621 -rw-r----- 1 mysql dba 37916 Oct 11 12:30 mysql-relay-bin.036622 -rw-r----- 1 mysql dba 507 Oct 11 12:30 mysql-relay-bin.036623 -rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036624 -rw-r----- 1 mysql dba 4909 Oct 11 12:31 mysql-relay-bin.036625 -rw-r----- 1 mysql dba 1188 Oct 11 12:31 mysql-relay-bin.036626 -rw-r----- 1 mysql dba 507 Oct 11 12:31 mysql-relay-bin.036627 -rw-r----- 1 mysql dba 507 Oct 11 12:32 mysql-relay-bin.036628 -rw-r----- 1 mysql dba 1188 Oct 11 12:32 mysql-relay-bin.036629 -rw-r----- 1 mysql dba 454 Oct 11 12:32 mysql-relay-bin.036630 -rw-r----- 1 mysql dba 6223 Oct 11 12:32 mysql-relay-bin.index 2、主庫錯誤日志有如下錯誤 2019-10-11T12:31:26.517309+08:00 61303425 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303421). 2019-10-11T12:31:26.517489+08:00 61303425 [Note] Start binlog_dump to master_thread_id(61303425) slave_server(19304313), pos(, 4) 2019-10-11T12:31:44.203747+08:00 61303449 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303425). 2019-10-11T12:31:44.203896+08:00 61303449 [Note] Start binlog_dump to master_thread_id(61303449) slave_server(19304313), pos(, 4) 二、slave_net_timeout參數分析 實際上第一眼看這個案例我也覺得很奇怪,因為很少有人會去設置slave_net_timeout參數,同樣我們也沒有設置過,因此關注較少。但是 @peaceful自己找到了可能出現問題的設置就是當前從庫slave_net_timeout參數設置為10。我就順著這個線索往下分析,我們先來看看slave_net_timeout參數的功能。
If you are logging master connection information to tables, MASTER_HEARTBEAT_PERIOD can be seen as the value of the Heartbeat column of the mysql.slave_master_info table. Setting interval to 0 disables heartbeats altogether. The default value for interval is equal to the value of slave_net_timeout divided by 2. Setting @@global.slave_net_timeout to a value less than that of the current heartbeat interval results in a warning being issued. The effect of issuing RESET SLAVE on the heartbeat interval is to reset it to the default value. 四、案例模擬 有了理論基礎就很好了模擬了,但是延遲這一點不太好模擬,因此我模擬的時候關閉了從庫的SQL線程來模擬積壓的情況。
提前配置好主從,查看當前的心跳周期和slave_net_timeout參數如下:
mysql> show variables like '%slave_net_timeout%'; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | slave_net_timeout | 60 | +-------------------+-------+ 1 row in set (0.01 sec) mysql> select Heartbeat from slave_master_info G *************************** 1. row *************************** Heartbeat: 30 1 row in set (0.00 sec) 1、停止從庫的SQL線程 stop slave sql_thread;
2、設置slave_net_timeout為10
mysql> set global slave_net_timeout=10; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> show warnings; +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Warning | 1704 | The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout. | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) 可以看到這里實際上已經有一個警告了。
-rw-r----- 1 mysql mysql 500 2019-09-27 23:48:32.655001361 +0800 relay.000142 -rw-r----- 1 mysql mysql 500 2019-09-27 23:48:42.943001355 +0800 relay.000143 -rw-r----- 1 mysql mysql 500 2019-09-27 23:48:53.293001363 +0800 relay.000144 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:03.502000598 +0800 relay.000145 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:13.799001357 +0800 relay.000146 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:24.055001354 +0800 relay.000147 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:34.280001827 +0800 relay.000148 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:44.496001365 +0800 relay.000149 -rw-r----- 1 mysql mysql 500 2019-09-27 23:49:54.789001353 +0800 relay.000150 -rw-r----- 1 mysql mysql 500 2019-09-27 23:50:05.485001371 +0800 relay.000151 -rw-r----- 1 mysql mysql 500 2019-09-27 23:50:15.910001430 +0800 relay.000152 大概每10秒主庫的日志會輸出如下日志:
2019-10-08T02:27:24.996827+08:00 217 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(216). 2019-10-08T02:27:24.998297+08:00 217 [Note] Start binlog_dump to master_thread_id(217) slave_server(953340), pos(, 4) 2019-10-08T02:27:35.265961+08:00 218 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(217). 2019-10-08T02:27:35.266653+08:00 218 [Note] Start binlog_dump to master_thread_id(218) slave_server(953340), pos(, 4) 2019-10-08T02:27:45.588074+08:00 219 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(218). 2019-10-08T02:27:45.589814+08:00 219 [Note] Start binlog_dump to master_thread_id(219) slave_server(953340), pos(, 4) 2019-10-08T02:27:55.848558+08:00 220 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(219). 2019-10-08T02:27:55.849442+08:00 220 [Note] Start binlog_dump to master_thread_id(220) slave_server(953340), pos(, 4) 這個日志就和案例中的一模一樣了。
->Binlog_sender::wait_new_events ->Binlog_sender::wait_with_heartbeat set_timespec_nsec(&ts, m_heartbeat_period); //心跳超時 ret= mysql_bin_log.wait_for_update_bin_log(m_thd, &ts);//等待 if (ret != ETIMEDOUT && ret != ETIME) //如果是正常收到則收到信號,說明有新的Event到來,否則如果是超時則發送心跳Event break; //正常返回0 是超時返回ETIMEDOUT 繼續循環 if (send_heartbeat_event(log_pos)) //發送心跳Event return 1; 5、重連會殺掉可能的存在的DUMP線程 根據UUID進行比對如下:
->kill_zombie_dump_threads Find_zombie_dump_thread find_zombie_dump_thread(slave_uuid); THD *tmp= Global_THD_manager::get_instance()-> find_thd(&find_zombie_dump_thread); if (tmp) { /* Here we do not call kill_one_thread() as it will be slow because it will iterate through the list again. We just to do kill the thread ourselves. */ if (log_warnings > 1) { if (slave_uuid.length()) { sql_print_information("While initializing dump thread for slave with " "UUID <%s>, found a zombie dump thread with the " "same UUID. Master is killing the zombie dump " "thread(%u).", slave_uuid.c_ptr(), tmp->thread_id()); }//這里就是本案例中的日志了 ..... 這里我們看到了案例中的日志。