内容简介:[TOC]2018年9月16日台风过后,我的一套kuernetes测试系统,etcd启动失败,经过半天的抢救,仍然无果(3台master都是如下错误)。无奈再花半天时间把环境重新弄了起来。即使是etcd集群,备份也是必须的,因为数据没了,就都没了。好在问题出现得早,要是正式生产出现这种情况,估计要卷铺盖走人了。因此,研究下kubernetes备份。kubeadm安装的kubernetes1.11
kubeadm安装的Kubernetes etcd备份恢复
[TOC]
1. 事件由来
2018年9月16日台风过后,我的一套kuernetes测试系统,etcd启动失败,经过半天的抢救,仍然无果(3台master都是如下错误)。无奈再花半天时间把环境重新弄了起来。即使是etcd集群,备份也是必须的,因为数据没了,就都没了。好在问题出现得早,要是正式生产出现这种情况,估计要卷铺盖走人了。因此,研究下kubernetes备份。
2018-09-17 00:11:55.781279 I | etcdmain: etcd Version: 3.2.18 2018-09-17 00:11:55.781457 I | etcdmain: Git SHA: eddf599c6 2018-09-17 00:11:55.781477 I | etcdmain: Go Version: go1.8.7 2018-09-17 00:11:55.781503 I | etcdmain: Go OS/Arch: linux/amd64 2018-09-17 00:11:55.781519 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32 2018-09-17 00:11:55.781634 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2018-09-17 00:11:55.781702 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = , trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true 2018-09-17 00:11:55.783073 I | embed: listening for peers on https://192.168.105.92:2380 2018-09-17 00:11:55.783182 I | embed: listening for client requests on 127.0.0.1:2379 2018-09-17 00:11:55.783281 I | embed: listening for client requests on 192.168.105.92:2379 2018-09-17 00:11:55.791474 I | etcdserver: recovered store from snapshot at index 16471696 2018-09-17 00:11:55.792633 I | mvcc: restore compact to 13683366 2018-09-17 00:11:55.849153 C | mvcc: store.keyindex: put with unexpected smaller revision [{13685569 0} / {13685569 0}] panic: store.keyindex: put with unexpected smaller revision [{13685569 0} / {13685569 0}] goroutine 89 [running]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42018c160, 0xfa564e, 0x3e, 0xc420062cb0, 0x2, 0x2) /tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*keyIndex).put(0xc4207fd7c0, 0xd0d341, 0x0) /tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/key_index.go:80 +0x3ec github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.restoreIntoIndex.func1(0xc42029e460, 0xc4202a0600, 0x14bef40, 0xc420285640) /tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:367 +0x3e3 created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.restoreIntoIndex /tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:374 +0xa5
2. 环境说明
kubeadm安装的kubernetes1.11
3. etcd集群查看
# 列出成员 etcdctl --endpoints=https://192.168.105.92:2379,https://192.168.105.93:2379,https://192.168.105.94:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list # 列出kubernetes数据 export ETCDCTL_API=3 etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
4. etcd数据备份
/etc/kubernetes/ /var/lib/kubelet/
将脚本添加到计划任务,每日备份。
#!/usr/bin/env bash ############################################################## # File Name: ut_backup_k8s.sh # Version: V1.0 # Author: Chinge_Yang # Blog: http://blog.csdn.net/ygqygq2 # Created Time : 2018-09-18 09:13:55 # Description: ############################################################## #获取脚本所存放目录 cd `dirname $0` bash_path=`pwd` #脚本名 me=$(basename $0) # delete dir and keep days delete_dirs=("/data/backup/kubernetes:7") backup_dir=/data/backup/kubernetes files_dir=("/etc/kubernetes" "/var/lib/kubelet") log_dir=$backup_dir/log shell_log=$log_dir/${USER}_${me}.log ssh_port="22" ssh_parameters="-o StrictHostKeyChecking=no -o ConnectTimeout=60" ssh_command="ssh ${ssh_parameters} -p ${ssh_port}" scp_command="scp ${ssh_parameters} -P ${ssh_port}" DATE=$(date +%F) BACK_SERVER="127.0.0.1" # 远程备份服务器IP BACK_SERVER_BASE_DIR="/data/backup" BACK_SERVER_DIR="$BACK_SERVER_BASE_DIR/kubernetes/${HOSTNAME}" # 远程备份服务器目录 BACK_SERVER_LOG_DIR="$BACK_SERVER_BASE_DIR/kubernetes/logs" #定义保存日志函数 function save_log () { echo -e "`date +%F\ %T` $*" >> $shell_log } save_log "start backup mysql" [ ! -d $log_dir ] && mkdir -p $log_dir #定义输出颜色函数 function red_echo () { #用法: red_echo "内容" local what=$* echo -e "\e[1;31m ${what} \e[0m" } function green_echo () { #用法: green_echo "内容" local what=$* echo -e "\e[1;32m ${what} \e[0m" } function yellow_echo () { #用法: yellow_echo "内容" local what=$* echo -e "\e[1;33m ${what} \e[0m" } function twinkle_echo () { #用法: twinkle_echo $(red_echo "内容") ,此处例子为红色闪烁输出 local twinkle='\e[05m' local what="${twinkle} $*" echo -e "${what}" } function return_echo () { [ $? -eq 0 ] && green_echo "$* 成功" || red_echo "$* 失败" } function return_error_exit () { [ $? -eq 0 ] && REVAL="0" local what=$* if [ "$REVAL" = "0" ];then [ ! -z "$what" ] && green_echo "$what 成功" else red_echo "$* 失败,脚本退出" exit 1 fi } #定义确认函数 function user_verify_function () { while true;do echo "" read -p "是否确认?[Y/N]:" Y case $Y in [yY]|[yY][eE][sS]) echo -e "answer: \\033[20G [ \e[1;32m是\e[0m ] \033[0m" break ;; [nN]|[nN][oO]) echo -e "answer: \\033[20G [ \e[1;32m否\e[0m ] \033[0m" exit 1 ;; *) continue ;; esac done } #定义跳过函数 function user_pass_function () { while true;do echo "" read -p "是否确认?[Y/N]:" Y case $Y in [yY]|[yY][eE][sS]) echo -e "answer: \\033[20G [ \e[1;32m是\e[0m ] \033[0m" break ;; [nN]|[nN][oO]) echo -e "answer: \\033[20G [ \e[1;32m否\e[0m ] \033[0m" return 1 ;; *) continue ;; esac done } function backup () { for f_d in ${files_dir[@]}; do f_name=$(basename ${f_d}) d_name=$(dirname $f_d) cd $d_name tar -cjf ${f_name}.tar.bz $f_name if [ $? -eq 0 ]; then file_size=$(du ${f_name}.tar.bz|awk '{print $1}') save_log "$file_size ${f_name}.tar.bz" save_log "finish tar ${f_name}.tar.bz" else file_size=0 save_log "failed tar ${f_name}.tar.bz" fi rsync -avzP ${f_name}.tar.bz $backup_dir/$(date +%F)-${f_name}.tar.bz rm -f ${f_name}.tar.bz done export ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ snapshot save $backup_dir/$(date +%F)-k8s-snapshot.db cd $backup_dir tar -cjf $(date +%F)-k8s-snapshot.tar.bz $(date +%F)-k8s-snapshot.db if [ $? -eq 0 ]; then file_size=$(du $(date +%F)-k8s-snapshot.tar.bz|awk '{print $1}') save_log "$file_size ${f_name}.tar.bz" save_log "finish tar ${f_name}.tar.bz" else file_size=0 save_log "failed tar ${f_name}.tar.bz" fi rm -f $(date +%F)-k8s-snapshot.db } function rsync_backup_files () { # 传输日志文件 #传输到远程服务器备份, 需要配置免密ssh认证 $ssh_command root@${BACK_SERVER} "mkdir -p ${BACK_SERVER_DIR}/${DATE}/" rsync -avz --bwlimit=5000 -e "${ssh_command}" $backup_dir/*.bz \ root@${BACK_SERVER}:${BACK_SERVER_DIR}/${DATE}/ [ $? -eq 0 ] && save_log "success rsync" || \ save_log "failed rsync" } function delete_old_files () { for delete_dir_keep_days in ${delete_dirs[@]}; do delete_dir=$(echo $delete_dir_keep_days|awk -F':' '{print $1}') keep_days=$(echo $delete_dir_keep_days|awk -F':' '{print $2}') [ -n "$delete_dir" ] && cd ${delete_dir} [ $? -eq 0 ] && find -L ${delete_dir} -mindepth 1 -mtime +$keep_days -exec rm -rf {} \; done } backup delete_old_files #rsync_backup_files save_log "finish $0\n" exit 0
5. etcd数据恢复
注意
数据恢复操作,会停止全部应用状态和访问!!!
首先需要分别停掉三台Master机器的kube-apiserver,确保kube-apiserver已经停止了。
mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak docker ps|grep k8s_ # 查看etcd、api是否up,等待全部停止 mv /var/lib/etcd /var/lib/etcd.bak
etcd集群用同一份snapshot恢复。
# 准备恢复文件 cd /tmp tar -jxvf /data/backup/kubernetes/2018-09-18-k8s-snapshot.tar.bz rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.93:/tmp/ rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.94:/tmp/
在lab1上执行:
cd /tmp/ export ETCDCTL_API=3 etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \ --endpoints=192.168.105.92:2379 \ --name=lab1 \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --initial-advertise-peer-urls=https://192.168.105.92:2380 \ --initial-cluster-token=etcd-cluster-0 \ --initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \ --data-dir=/var/lib/etcd
在lab2上执行:
cd /tmp/ export ETCDCTL_API=3 etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \ --endpoints=192.168.105.93:2379 \ --name=lab2 \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --initial-advertise-peer-urls=https://192.168.105.93:2380 \ --initial-cluster-token=etcd-cluster-0 \ --initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \ --data-dir=/var/lib/etcd
在lab3上执行:
cd /tmp/ export ETCDCTL_API=3 etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \ --endpoints=192.168.105.94:2379 \ --name=lab3 \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --initial-advertise-peer-urls=https://192.168.105.94:2380 \ --initial-cluster-token=etcd-cluster-0 \ --initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \ --data-dir=/var/lib/etcd
全部恢复完成后,三台Master机器恢复manifests。
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
最后确认:
# 再次查看key [root@lab1 kubernetes]# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt registry/apiextensions.k8s.io/customresourcedefinitions/apprepositories.kubeapps.com /registry/apiregistration.k8s.io/apiservices/v1. /registry/apiregistration.k8s.io/apiservices/v1.apps /registry/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io ........此处省略.......... [root@lab1 kubernetes]# kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE coredns-777d78ff6f-m5chm 1/1 Running 1 18h coredns-777d78ff6f-xm7q8 1/1 Running 1 18h dashboard-kubernetes-dashboard-7cfc6c7bf5-hr96q 1/1 Running 0 13h dashboard-kubernetes-dashboard-7cfc6c7bf5-x9p7j 1/1 Running 0 13h etcd-lab1 1/1 Running 0 18h etcd-lab2 1/1 Running 0 1m etcd-lab3 1/1 Running 0 18h kube-apiserver-lab1 1/1 Running 0 18h kube-apiserver-lab2 1/1 Running 0 1m kube-apiserver-lab3 1/1 Running 0 18h kube-controller-manager-lab1 1/1 Running 0 18h kube-controller-manager-lab2 1/1 Running 0 1m kube-controller-manager-lab3 1/1 Running 0 18h kube-flannel-ds-7w6rl 1/1 Running 2 18h kube-flannel-ds-b9pkf 1/1 Running 2 18h kube-flannel-ds-fck8t 1/1 Running 1 18h kube-flannel-ds-kklxs 1/1 Running 1 18h kube-flannel-ds-lxxx9 1/1 Running 2 18h kube-flannel-ds-q7lpg 1/1 Running 1 18h kube-flannel-ds-tlqqn 1/1 Running 1 18h kube-proxy-85j7g 1/1 Running 1 18h kube-proxy-gdvkk 1/1 Running 1 18h kube-proxy-jw5gh 1/1 Running 1 18h kube-proxy-pgfxf 1/1 Running 1 18h kube-proxy-qx62g 1/1 Running 1 18h kube-proxy-rlbdb 1/1 Running 1 18h kube-proxy-whhcv 1/1 Running 1 18h kube-scheduler-lab1 1/1 Running 0 18h kube-scheduler-lab2 1/1 Running 0 1m kube-scheduler-lab3 1/1 Running 0 18h kubernetes-dashboard-754f4d5f69-7npk5 1/1 Running 0 13h kubernetes-dashboard-754f4d5f69-whtg9 1/1 Running 0 13h tiller-deploy-98f7f7564-59hcs 1/1 Running 0 13h
进相应的安装程序确认,数据全部正常。
6. 小结
不管是二进制还是kubeadm安装的Kubernetes,其备份主要是通过etcd的备份完成的。而恢复时,主要考虑的是整个顺序:停止kube-apiserver,停止etcd,恢复数据,启动etcd,启动kube-apiserver。
参考资料:
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- PostgreSQL基础备份_增量备份与任意点恢复
- 010.MongoDB备份恢复
- Oracle 备份与恢复笔记
- Postgresql备份与增量恢复
- MySQL 备份与恢复详解
- MongoDB导入导出备份恢复实践
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Head First HTML5 Programming
Eric Freeman、Elisabeth Robson / O'Reilly Media / 2011-10-18 / USD 49.99
What can HTML5 do for you? If you're a web developer looking to use this new version of HTML, you might be wondering how much has really changed. Head First HTML5 Programming introduces the key featur......一起来看看 《Head First HTML5 Programming》 这本书的介绍吧!