内容简介:研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master+node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响启动docker试试
研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master+node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样
起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响
[root@k8s-m1 ~]# kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.252.146.104 NotReady <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
启动 docker 试试
[root@k8s-m1 ~]# systemctl start docker Job for docker.service canceled.
无法启动,查看下启动失败的服务
[root@k8s-m1 ~]# systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● containerd.service loaded failed failed containerd container runtime
查看下 containerd
的日志
[root@k8s-m1 ~]# journalctl -xe -u containerd Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481459735+08:00" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481472223+08:00" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481517630+08:00" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481562176+08:00" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481964349+08:00" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481996158+08:00" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482048208+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482081110+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482096598+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482112263+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482123307+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482133477+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482142943+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482151644+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482160741+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482184201+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482194643+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482206871+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482215454+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482365838+08:00" level=info msg=serving... address="/run/containerd/containerd.sock" Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482404139+08:00" level=info msg="containerd successfully booted in 0.003611s" Jul 23 11:20:11 k8s-m1 containerd[9186]: panic: runtime error: invalid memory address or nil pointer dereference Jul 23 11:20:11 k8s-m1 containerd[9186]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5626b983c259] Jul 23 11:20:11 k8s-m1 containerd[9186]: goroutine 55 [running]: Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Cursor(...) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:84 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Get(0x0, 0x5626bb7e3f10, 0xb, 0xb, 0x0, 0x2, 0x4) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:260 +0x39 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots.func6(0x7fe557c63020, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x5626b95eec72) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:222 +0xcb Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).ForEach(0xc0003d1780, 0xc00057b640, 0xa, 0xa) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:388 +0x100 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots(0x5626bacedde0, 0xc0003d1680, 0xc0002ee2a0, 0xc00031a3c0, 0xc000527a60, 0x7fe586a43fff) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:216 +0x4df Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked.func1(0xc0002ee2a0, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:359 +0x165 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*DB).View(0xc00000c1e0, 0xc00008b860, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:701 +0x92 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511 Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1 Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462 Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.
这个问题从panic抛出的堆栈信息看和我之前文章docker启动panic很类似,都是 boltdb 文件出错,找下 git 信息去看看代码路径在哪
[root@k8s-m1 ~]# systemctl cat containerd | grep ExecStart ExecStartPre=-/sbin/modprobe overlay ExecStart=/usr/bin/containerd [root@k8s-m1 ~]# /usr/bin/containerd --version containerd containerd.io 1.2.13 7ad184331fa3e55e52b890ea95e65ba581ae3429
meta.db
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L257
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L79
https://github.com/containerd/containerd/blob/v1.2.13/services/server/server.go#L261-L268
查找下 ic.Root
路径是多少
[root@k8s-m1 ~]# /usr/bin/containerd --help | grep config config information on the containerd config --config value, -c value path to the configuration file (default: "/etc/containerd/config.toml") [root@k8s-m1 ~]# grep root /etc/containerd/config.toml #root = "/var/lib/containerd" [root@k8s-m1 ~]]# find /var/lib/containerd -type f -name meta.db /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
找到boltdb文件,改名启动
[root@k8s-m1 ~]]# mv /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db{,.bak} [root@k8s-m1 ~]# systemctl status containerd.service ● containerd.service - containerd container runtime Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2020-07-23 11:20:11 CST; 17min ago Docs: https://containerd.io Process: 9186 ExecStart=/usr/bin/containerd (code=exited, status=2) Process: 9182 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 9186 (code=exited, status=2) Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511 Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1 Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462 Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'. [root@k8s-m1 ~]# systemctl restart containerd.service [root@k8s-m1 ~]# systemctl status containerd.service ● containerd.service - containerd container runtime Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2020-07-23 11:25:37 CST; 1s ago Docs: https://containerd.io Process: 15661 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 15663 (containerd) Tasks: 16 Memory: 28.6M CGroup: /system.slice/containerd.service └─15663 /usr/bin/containerd Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496725460+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496734129+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496742793+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496751740+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496775185+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496785498+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496794873+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496803178+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496944458+08:00" level=info msg=serving... address="/run/containerd/containerd.sock" Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496958031+08:00" level=info msg="containerd successfully booted in 0.003994s"
containerd 起来后,启动下 docker
[root@k8s-m1 ~]# systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─10-docker.conf Active: inactive (dead) since Thu 2020-07-23 11:20:13 CST; 18min ago Docs: https://docs.docker.com Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS) Process: 9187 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS) Main PID: 9187 (code=exited, status=0/SUCCESS) Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956503485+08:00" level=error msg="Stop container error: Failed to stop container 68860c8d16b9ce7e74e8efd9db00e70a57eef1b752c2e6c703073c0bce5517d3 with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954347116+08:00" level=error msg="Stop container error: Failed to stop container 5ec9922beed1276989f1866c3fd911f37cc26aae4e4b27c7ce78183a9a4725cc with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.953615411+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing" Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956557179+08:00" level=error msg="Stop container error: Failed to stop container 6d0096fbcd4055f8bafb6b38f502a0186cd1dfca34219e9dd6050f512971aef5 with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954601191+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing" Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956600790+08:00" level=error msg="Stop container error: Failed to stop container 6d1175ba6c55cb05ad89f4134ba8e9d3495c5acb5f07938dc16339b7cca013bf with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957188989+08:00" level=info msg="Daemon shutdown complete" Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957212655+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957209679+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby Jul 23 11:20:13 k8s-m1 systemd[1]: Stopped Docker Application Container Engine. [root@k8s-m1 ~]# systemctl start docker [root@k8s-m1 ~]# systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─10-docker.conf Active: active (running) since Thu 2020-07-23 11:26:11 CST; 1s ago Docs: https://docs.docker.com Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS) Process: 16156 ExecStartPost=/sbin/iptables -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS) Main PID: 15974 (dockerd) Tasks: 62 Memory: 89.1M CGroup: /system.slice/docker.service └─15974 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.851106564+08:00" level=error msg="cb4e16249cd8eac48ed734c71237195f04d63c56c55c0199b3cdf3d49461903d cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.860456898+08:00" level=error msg="d9bbcab186ccb59f96c95fc886ec1b66a52aa96e45b117cf7d12e3ff9b95db9f cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.872405757+08:00" level=error msg="07eb7a09bc8589abcb4d79af4b46798327bfb00624a7b9ceea457de392ad8f3d cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.877896618+08:00" level=error msg="f5867657025bd7c3951cbd3e08ad97338cf69df2a97967a419e0e78eda869b73 cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.143661583+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.198200760+08:00" level=info msg="Loading containers: done." Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.219959208+08:00" level=info msg="Docker daemon" commit=42e35e61f3 graphdriver(s)=overlay2 version=19.03.11 Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.220049865+08:00" level=info msg="Daemon has completed initialization" Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.232373131+08:00" level=info msg="API listen on /var/run/docker.sock" Jul 23 11:26:11 k8s-m1 systemd[1]: Started Docker Application Container Engine.
etcd启动也失败,journal 查看下 etcd 状态
[root@k8s-m1 ~]# journalctl -xe -u etcd Jul 23 11:26:15 k8s-m1 etcd[18129]: Loading server configuration from "/etc/etcd/etcd.config.yml" Jul 23 11:26:15 k8s-m1 etcd[18129]: etcd Version: 3.3.20 Jul 23 11:26:15 k8s-m1 etcd[18129]: Git SHA: 9fd7e2b80 Jul 23 11:26:15 k8s-m1 etcd[18129]: Go Version: go1.12.17 Jul 23 11:26:15 k8s-m1 etcd[18129]: Go OS/Arch: linux/amd64 Jul 23 11:26:15 k8s-m1 etcd[18129]: setting maximum number of CPUs to 16, total number of available CPUs is 16 Jul 23 11:26:15 k8s-m1 etcd[18129]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd) Jul 23 11:26:15 k8s-m1 etcd[18129]: the server is already initialized as member before, starting as etcd member... Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring peer auto TLS since certs given Jul 23 11:26:15 k8s-m1 etcd[18129]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals> Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for peers on https://10.252.146.104:2380 Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring client auto TLS since certs given Jul 23 11:26:15 k8s-m1 etcd[18129]: pprof is enabled under /debug/pprof Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url. Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 127.0.0.1:2379 Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 10.252.146.104:2379 Jul 23 11:26:15 k8s-m1 etcd[18129]: skipped unexpected non snapshot file 000000000000002e-000000000052f2be.snap.broken Jul 23 11:26:15 k8s-m1 etcd[18129]: recovered store from snapshot at index 5426092 Jul 23 11:26:15 k8s-m1 etcd[18129]: restore compact to 3967425 Jul 23 11:26:15 k8s-m1 etcd[18129]: cannot unmarshal event: proto: KeyValue: illegal tag 0 (wire type 0) Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'. Jul 23 11:26:15 k8s-m1 systemd[1]: Failed to start Etcd Service. [root@k8s-m1 ~]# ll /var/lib/etcd/member/snap/ total 8560 -rw-r--r-- 1 root root 13499 Jul 20 13:36 000000000000002e-000000000052cbac.snap -rw-r--r-- 2 root root 128360 Jul 20 13:01 000000000000002e-000000000052f2be.snap.broken -rw------- 1 root root 8617984 Jul 23 11:26 db
这套集群是使用我的 ansible部署,求star 的,自带了 备份脚本 ,但是是三天前坏的
[root@k8s-m1 ~]# ll /opt/etcd_bak/ total 41524 -rw-r--r-- 1 root root 8618016 Jul 17 02:00 etcd-2020-07-17-02:00:01.db -rw-r--r-- 1 root root 8618016 Jul 18 02:00 etcd-2020-07-18-02:00:01.db -rw-r--r-- 1 root root 8323104 Jul 19 02:00 etcd-2020-07-19-02:00:01.db -rw-r--r-- 1 root root 8618016 Jul 20 02:00 etcd-2020-07-20-02:00:01.db
有恢复剧本,但是前提是etcd的v2和v3不能共存,否则无法恢复备份,我们线上都是把v2的存储关闭了的。主要是 这个tasks里的26到42行步骤 ,这里复制了其他机器master上的 07/23 号的etcd备份文件,然后改了下host跑了下
[root@k8s-m1 ~]# cd Kubernetes-ansible [root@k8s-m1 Kubernetes-ansible]# ansible-playbook restoreETCD.yml -e 'db=/opt/etcd_bak/etcd-bak.db' PLAY [10.252.146.104] ********************************************************************************************************************************************************************************************************************** TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************* ok: [10.252.146.104] TASK [restoreETCD : fail] ****************************************************************************************************************************************************************************************************************** skipping: [10.252.146.104] TASK [restoreETCD : 检测备份文件存在否] ************************************************************************************************************************************************************************************************************* ok: [10.252.146.104] TASK [restoreETCD : fail] ****************************************************************************************************************************************************************************************************************** skipping: [10.252.146.104] TASK [restoreETCD : set_fact] ************************************************************************************************************************************************************************************************************** skipping: [10.252.146.104] TASK [restoreETCD : set_fact] ************************************************************************************************************************************************************************************************************** ok: [10.252.146.104] TASK [restoreETCD : 停止etcd] **************************************************************************************************************************************************************************************************************** ok: [10.252.146.104] TASK [restoreETCD : 删除etcd数据目录] ************************************************************************************************************************************************************************************************************ ok: [10.252.146.104] => (item=/var/lib/etcd) TASK [restoreETCD : 分发备份文件] **************************************************************************************************************************************************************************************************************** ok: [10.252.146.104] TASK [restoreETCD : 恢复备份] ****************************************************************************************************************************************************************************************************************** changed: [10.252.146.104] TASK [restoreETCD : 启动etcd] **************************************************************************************************************************************************************************************************************** fatal: [10.252.146.104]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"} PLAY RECAP ********************************************************************************************************************************************************************************************************************************* 10.252.146.104 : ok=7 changed=1 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
查看下日志
[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd Jul 23 11:27:46 k8s-m1 etcd[58954]: Loading server configuration from "/etc/etcd/etcd.config.yml" Jul 23 11:27:46 k8s-m1 etcd[58954]: etcd Version: 3.3.20 Jul 23 11:27:46 k8s-m1 etcd[58954]: Git SHA: 9fd7e2b80 Jul 23 11:27:46 k8s-m1 etcd[58954]: Go Version: go1.12.17 Jul 23 11:27:46 k8s-m1 etcd[58954]: Go OS/Arch: linux/amd64 Jul 23 11:27:46 k8s-m1 etcd[58954]: setting maximum number of CPUs to 16, total number of available CPUs is 16 Jul 23 11:27:46 k8s-m1 etcd[58954]: the server is already initialized as member before, starting as etcd member... Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring peer auto TLS since certs given Jul 23 11:27:46 k8s-m1 etcd[58954]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals> Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for peers on https://10.252.146.104:2380 Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring client auto TLS since certs given Jul 23 11:27:46 k8s-m1 etcd[58954]: pprof is enabled under /debug/pprof Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url. Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 127.0.0.1:2379 Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 10.252.146.104:2379 Jul 23 11:27:47 k8s-m1 etcd[58954]: member ac2dcf6aed12e8f1 has already been bootstrapped Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'. Jul 23 11:27:47 k8s-m1 systemd[1]: Failed to start Etcd Service.
这个 member xxxx has already been bootstrapped
解决办法就是把配置文件的下面修改,后面启动完记得改回来
initial-cluster-state: 'new' 改成 initial-cluster-state: 'existing'
然后成功启动
[root@k8s-m1 Kubernetes-ansible]# systemctl start etcd [root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd Jul 23 11:27:55 k8s-m1 etcd[59889]: Loading server configuration from "/etc/etcd/etcd.config.yml" Jul 23 11:27:55 k8s-m1 etcd[59889]: etcd Version: 3.3.20 Jul 23 11:27:55 k8s-m1 etcd[59889]: Git SHA: 9fd7e2b80 Jul 23 11:27:55 k8s-m1 etcd[59889]: Go Version: go1.12.17 Jul 23 11:27:55 k8s-m1 etcd[59889]: Go OS/Arch: linux/amd64 Jul 23 11:27:55 k8s-m1 etcd[59889]: setting maximum number of CPUs to 16, total number of available CPUs is 16 Jul 23 11:27:55 k8s-m1 etcd[59889]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd) Jul 23 11:27:55 k8s-m1 etcd[59889]: the server is already initialized as member before, starting as etcd member... Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring peer auto TLS since certs given Jul 23 11:27:55 k8s-m1 etcd[59889]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals> Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for peers on https://10.252.146.104:2380 Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring client auto TLS since certs given Jul 23 11:27:55 k8s-m1 etcd[59889]: pprof is enabled under /debug/pprof Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url. Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 127.0.0.1:2379 Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 10.252.146.104:2379 Jul 23 11:27:55 k8s-m1 etcd[59889]: recovered store from snapshot at index 5952463 Jul 23 11:27:55 k8s-m1 etcd[59889]: restore compact to 4369703 Jul 23 11:27:55 k8s-m1 etcd[59889]: name = etcd-001 Jul 23 11:27:55 k8s-m1 etcd[59889]: data dir = /var/lib/etcd Jul 23 11:27:55 k8s-m1 etcd[59889]: member dir = /var/lib/etcd/member Jul 23 11:27:55 k8s-m1 etcd[59889]: dedicated WAL dir = /var/lib/etcd/wal Jul 23 11:27:55 k8s-m1 etcd[59889]: heartbeat = 100ms Jul 23 11:27:55 k8s-m1 etcd[59889]: election = 1000ms Jul 23 11:27:55 k8s-m1 etcd[59889]: snapshot count = 5000 Jul 23 11:27:55 k8s-m1 etcd[59889]: advertise client URLs = https://10.252.146.104:2379 Jul 23 11:27:55 k8s-m1 etcd[59889]: restarting member ac2dcf6aed12e8f1 in cluster 367e2aebc6430cbe at commit index 5952491 Jul 23 11:27:55 k8s-m1 etcd[59889]: ac2dcf6aed12e8f1 became follower at term 47 Jul 23 11:27:55 k8s-m1 etcd[59889]: newRaft ac2dcf6aed12e8f1 [peers: [1e713be314744d53,8b1621b475555fd9,ac2dcf6aed12e8f1], term: 47, commit: 5952491, applied: 5952463, lastindex: 5952491, lastterm: 47] Jul 23 11:27:55 k8s-m1 etcd[59889]: enabled capabilities for version 3.3 Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 1e713be314744d53 [https://10.252.146.105:2380] to cluster 367e2aebc6430cbe from store Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 8b1621b475555fd9 [https://10.252.146.106:2380] to cluster 367e2aebc6430cbe from store Jul 23 11:27:55 k8s-m1 etcd[59889]: added member ac2dcf6aed12e8f1 [https://10.252.146.104:2380] to cluster 367e2aebc6430cbe from store
查看集群状态
[root@k8s-m1 Kubernetes-ansible]# etcd-ha +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://10.252.146.104:2379 | ac2dcf6aed12e8f1 | 3.3.20 | 8.3 MB | false | 47 | 5953557 | | https://10.252.146.105:2379 | 1e713be314744d53 | 3.3.20 | 8.6 MB | false | 47 | 5953557 | | https://10.252.146.106:2379 | 8b1621b475555fd9 | 3.3.20 | 8.3 MB | true | 47 | 5953557 | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
然后给kube-apiserver三个组件和kubelet起来后
[root@k8s-m1 Kubernetes-ansible]# kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.252.146.104 Ready <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
pod也在慢慢自愈了
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- 实战:基于机器学习的智能故障诊断
- Azure如何利用机器学习“预知”虚拟机故障?
- 防患于未然好过亡羊补牢 Azure选择机器学习预测故障
- 故障公告:Linux 内核故障导致网站宕机近 1 个小时
- 线上故障处理手册
- PostgreSQL复制断开故障
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。