内容简介:研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master+node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响启动docker试试
研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master+node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样
起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响
[root@k8s-m1 ~]# kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.252.146.104 NotReady <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
启动 docker 试试
[root@k8s-m1 ~]# systemctl start docker Job for docker.service canceled.
无法启动,查看下启动失败的服务
[root@k8s-m1 ~]# systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● containerd.service loaded failed failed containerd container runtime
查看下 containerd
的日志
[root@k8s-m1 ~]# journalctl -xe -u containerd Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481459735+08:00" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481472223+08:00" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481517630+08:00" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481562176+08:00" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481964349+08:00" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481996158+08:00" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482048208+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482081110+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482096598+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482112263+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482123307+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482133477+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482142943+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482151644+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482160741+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482184201+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482194643+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482206871+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482215454+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1 Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482365838+08:00" level=info msg=serving... address="/run/containerd/containerd.sock" Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482404139+08:00" level=info msg="containerd successfully booted in 0.003611s" Jul 23 11:20:11 k8s-m1 containerd[9186]: panic: runtime error: invalid memory address or nil pointer dereference Jul 23 11:20:11 k8s-m1 containerd[9186]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5626b983c259] Jul 23 11:20:11 k8s-m1 containerd[9186]: goroutine 55 [running]: Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Cursor(...) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:84 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Get(0x0, 0x5626bb7e3f10, 0xb, 0xb, 0x0, 0x2, 0x4) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:260 +0x39 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots.func6(0x7fe557c63020, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x5626b95eec72) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:222 +0xcb Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).ForEach(0xc0003d1780, 0xc00057b640, 0xa, 0xa) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:388 +0x100 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots(0x5626bacedde0, 0xc0003d1680, 0xc0002ee2a0, 0xc00031a3c0, 0xc000527a60, 0x7fe586a43fff) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/gc.go:216 +0x4df Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked.func1(0xc0002ee2a0, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:359 +0x165 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*DB).View(0xc00000c1e0, 0xc00008b860, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:701 +0x92 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511 Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1 Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462 Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.
这个问题从panic抛出的堆栈信息看和我之前文章docker启动panic很类似,都是 boltdb 文件出错,找下 git 信息去看看代码路径在哪
[root@k8s-m1 ~]# systemctl cat containerd | grep ExecStart ExecStartPre=-/sbin/modprobe overlay ExecStart=/usr/bin/containerd [root@k8s-m1 ~]# /usr/bin/containerd --version containerd containerd.io 1.2.13 7ad184331fa3e55e52b890ea95e65ba581ae3429
meta.db
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L257
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L79
https://github.com/containerd/containerd/blob/v1.2.13/services/server/server.go#L261-L268
查找下 ic.Root
路径是多少
[root@k8s-m1 ~]# /usr/bin/containerd --help | grep config config information on the containerd config --config value, -c value path to the configuration file (default: "/etc/containerd/config.toml") [root@k8s-m1 ~]# grep root /etc/containerd/config.toml #root = "/var/lib/containerd" [root@k8s-m1 ~]]# find /var/lib/containerd -type f -name meta.db /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
找到boltdb文件,改名启动
[root@k8s-m1 ~]]# mv /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db{,.bak} [root@k8s-m1 ~]# systemctl status containerd.service ● containerd.service - containerd container runtime Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2020-07-23 11:20:11 CST; 17min ago Docs: https://containerd.io Process: 9186 ExecStart=/usr/bin/containerd (code=exited, status=2) Process: 9182 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 9186 (code=exited, status=2) Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3 Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010) Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511 Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1 Jul 23 11:20:11 k8s-m1 containerd[9186]: /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462 Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'. [root@k8s-m1 ~]# systemctl restart containerd.service [root@k8s-m1 ~]# systemctl status containerd.service ● containerd.service - containerd container runtime Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2020-07-23 11:25:37 CST; 1s ago Docs: https://containerd.io Process: 15661 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 15663 (containerd) Tasks: 16 Memory: 28.6M CGroup: /system.slice/containerd.service └─15663 /usr/bin/containerd Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496725460+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496734129+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496742793+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496751740+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496775185+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496785498+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496794873+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496803178+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1 Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496944458+08:00" level=info msg=serving... address="/run/containerd/containerd.sock" Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496958031+08:00" level=info msg="containerd successfully booted in 0.003994s"
containerd 起来后,启动下 docker
[root@k8s-m1 ~]# systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─10-docker.conf Active: inactive (dead) since Thu 2020-07-23 11:20:13 CST; 18min ago Docs: https://docs.docker.com Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS) Process: 9187 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS) Main PID: 9187 (code=exited, status=0/SUCCESS) Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956503485+08:00" level=error msg="Stop container error: Failed to stop container 68860c8d16b9ce7e74e8efd9db00e70a57eef1b752c2e6c703073c0bce5517d3 with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954347116+08:00" level=error msg="Stop container error: Failed to stop container 5ec9922beed1276989f1866c3fd911f37cc26aae4e4b27c7ce78183a9a4725cc with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.953615411+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing" Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956557179+08:00" level=error msg="Stop container error: Failed to stop container 6d0096fbcd4055f8bafb6b38f502a0186cd1dfca34219e9dd6050f512971aef5 with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954601191+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing" Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956600790+08:00" level=error msg="Stop container error: Failed to stop container 6d1175ba6c55cb05ad89f4134ba8e9d3495c5acb5f07938dc16339b7cca013bf with error: Cannot kill c> Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957188989+08:00" level=info msg="Daemon shutdown complete" Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957212655+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957209679+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby Jul 23 11:20:13 k8s-m1 systemd[1]: Stopped Docker Application Container Engine. [root@k8s-m1 ~]# systemctl start docker [root@k8s-m1 ~]# systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─10-docker.conf Active: active (running) since Thu 2020-07-23 11:26:11 CST; 1s ago Docs: https://docs.docker.com Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS) Process: 16156 ExecStartPost=/sbin/iptables -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS) Main PID: 15974 (dockerd) Tasks: 62 Memory: 89.1M CGroup: /system.slice/docker.service └─15974 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.851106564+08:00" level=error msg="cb4e16249cd8eac48ed734c71237195f04d63c56c55c0199b3cdf3d49461903d cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.860456898+08:00" level=error msg="d9bbcab186ccb59f96c95fc886ec1b66a52aa96e45b117cf7d12e3ff9b95db9f cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.872405757+08:00" level=error msg="07eb7a09bc8589abcb4d79af4b46798327bfb00624a7b9ceea457de392ad8f3d cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.877896618+08:00" level=error msg="f5867657025bd7c3951cbd3e08ad97338cf69df2a97967a419e0e78eda869b73 cleanup: failed to delete container from containerd: no such container" Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.143661583+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.198200760+08:00" level=info msg="Loading containers: done." Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.219959208+08:00" level=info msg="Docker daemon" commit=42e35e61f3 graphdriver(s)=overlay2 version=19.03.11 Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.220049865+08:00" level=info msg="Daemon has completed initialization" Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.232373131+08:00" level=info msg="API listen on /var/run/docker.sock" Jul 23 11:26:11 k8s-m1 systemd[1]: Started Docker Application Container Engine.
etcd启动也失败,journal 查看下 etcd 状态
[root@k8s-m1 ~]# journalctl -xe -u etcd Jul 23 11:26:15 k8s-m1 etcd[18129]: Loading server configuration from "/etc/etcd/etcd.config.yml" Jul 23 11:26:15 k8s-m1 etcd[18129]: etcd Version: 3.3.20 Jul 23 11:26:15 k8s-m1 etcd[18129]: Git SHA: 9fd7e2b80 Jul 23 11:26:15 k8s-m1 etcd[18129]: Go Version: go1.12.17 Jul 23 11:26:15 k8s-m1 etcd[18129]: Go OS/Arch: linux/amd64 Jul 23 11:26:15 k8s-m1 etcd[18129]: setting maximum number of CPUs to 16, total number of available CPUs is 16 Jul 23 11:26:15 k8s-m1 etcd[18129]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd) Jul 23 11:26:15 k8s-m1 etcd[18129]: the server is already initialized as member before, starting as etcd member... Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring peer auto TLS since certs given Jul 23 11:26:15 k8s-m1 etcd[18129]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals> Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for peers on https://10.252.146.104:2380 Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring client auto TLS since certs given Jul 23 11:26:15 k8s-m1 etcd[18129]: pprof is enabled under /debug/pprof Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url. Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 127.0.0.1:2379 Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 10.252.146.104:2379 Jul 23 11:26:15 k8s-m1 etcd[18129]: skipped unexpected non snapshot file 000000000000002e-000000000052f2be.snap.broken Jul 23 11:26:15 k8s-m1 etcd[18129]: recovered store from snapshot at index 5426092 Jul 23 11:26:15 k8s-m1 etcd[18129]: restore compact to 3967425 Jul 23 11:26:15 k8s-m1 etcd[18129]: cannot unmarshal event: proto: KeyValue: illegal tag 0 (wire type 0) Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'. Jul 23 11:26:15 k8s-m1 systemd[1]: Failed to start Etcd Service. [root@k8s-m1 ~]# ll /var/lib/etcd/member/snap/ total 8560 -rw-r--r-- 1 root root 13499 Jul 20 13:36 000000000000002e-000000000052cbac.snap -rw-r--r-- 2 root root 128360 Jul 20 13:01 000000000000002e-000000000052f2be.snap.broken -rw------- 1 root root 8617984 Jul 23 11:26 db
这套集群是使用我的 ansible部署,求star 的,自带了 备份脚本 ,但是是三天前坏的
[root@k8s-m1 ~]# ll /opt/etcd_bak/ total 41524 -rw-r--r-- 1 root root 8618016 Jul 17 02:00 etcd-2020-07-17-02:00:01.db -rw-r--r-- 1 root root 8618016 Jul 18 02:00 etcd-2020-07-18-02:00:01.db -rw-r--r-- 1 root root 8323104 Jul 19 02:00 etcd-2020-07-19-02:00:01.db -rw-r--r-- 1 root root 8618016 Jul 20 02:00 etcd-2020-07-20-02:00:01.db
有恢复剧本,但是前提是etcd的v2和v3不能共存,否则无法恢复备份,我们线上都是把v2的存储关闭了的。主要是 这个tasks里的26到42行步骤 ,这里复制了其他机器master上的 07/23 号的etcd备份文件,然后改了下host跑了下
[root@k8s-m1 ~]# cd Kubernetes-ansible [root@k8s-m1 Kubernetes-ansible]# ansible-playbook restoreETCD.yml -e 'db=/opt/etcd_bak/etcd-bak.db' PLAY [10.252.146.104] ********************************************************************************************************************************************************************************************************************** TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************* ok: [10.252.146.104] TASK [restoreETCD : fail] ****************************************************************************************************************************************************************************************************************** skipping: [10.252.146.104] TASK [restoreETCD : 检测备份文件存在否] ************************************************************************************************************************************************************************************************************* ok: [10.252.146.104] TASK [restoreETCD : fail] ****************************************************************************************************************************************************************************************************************** skipping: [10.252.146.104] TASK [restoreETCD : set_fact] ************************************************************************************************************************************************************************************************************** skipping: [10.252.146.104] TASK [restoreETCD : set_fact] ************************************************************************************************************************************************************************************************************** ok: [10.252.146.104] TASK [restoreETCD : 停止etcd] **************************************************************************************************************************************************************************************************************** ok: [10.252.146.104] TASK [restoreETCD : 删除etcd数据目录] ************************************************************************************************************************************************************************************************************ ok: [10.252.146.104] => (item=/var/lib/etcd) TASK [restoreETCD : 分发备份文件] **************************************************************************************************************************************************************************************************************** ok: [10.252.146.104] TASK [restoreETCD : 恢复备份] ****************************************************************************************************************************************************************************************************************** changed: [10.252.146.104] TASK [restoreETCD : 启动etcd] **************************************************************************************************************************************************************************************************************** fatal: [10.252.146.104]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"} PLAY RECAP ********************************************************************************************************************************************************************************************************************************* 10.252.146.104 : ok=7 changed=1 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
查看下日志
[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd Jul 23 11:27:46 k8s-m1 etcd[58954]: Loading server configuration from "/etc/etcd/etcd.config.yml" Jul 23 11:27:46 k8s-m1 etcd[58954]: etcd Version: 3.3.20 Jul 23 11:27:46 k8s-m1 etcd[58954]: Git SHA: 9fd7e2b80 Jul 23 11:27:46 k8s-m1 etcd[58954]: Go Version: go1.12.17 Jul 23 11:27:46 k8s-m1 etcd[58954]: Go OS/Arch: linux/amd64 Jul 23 11:27:46 k8s-m1 etcd[58954]: setting maximum number of CPUs to 16, total number of available CPUs is 16 Jul 23 11:27:46 k8s-m1 etcd[58954]: the server is already initialized as member before, starting as etcd member... Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring peer auto TLS since certs given Jul 23 11:27:46 k8s-m1 etcd[58954]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals> Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for peers on https://10.252.146.104:2380 Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring client auto TLS since certs given Jul 23 11:27:46 k8s-m1 etcd[58954]: pprof is enabled under /debug/pprof Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url. Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 127.0.0.1:2379 Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 10.252.146.104:2379 Jul 23 11:27:47 k8s-m1 etcd[58954]: member ac2dcf6aed12e8f1 has already been bootstrapped Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'. Jul 23 11:27:47 k8s-m1 systemd[1]: Failed to start Etcd Service.
这个 member xxxx has already been bootstrapped
解决办法就是把配置文件的下面修改,后面启动完记得改回来
initial-cluster-state: 'new' 改成 initial-cluster-state: 'existing'
然后成功启动
[root@k8s-m1 Kubernetes-ansible]# systemctl start etcd [root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd Jul 23 11:27:55 k8s-m1 etcd[59889]: Loading server configuration from "/etc/etcd/etcd.config.yml" Jul 23 11:27:55 k8s-m1 etcd[59889]: etcd Version: 3.3.20 Jul 23 11:27:55 k8s-m1 etcd[59889]: Git SHA: 9fd7e2b80 Jul 23 11:27:55 k8s-m1 etcd[59889]: Go Version: go1.12.17 Jul 23 11:27:55 k8s-m1 etcd[59889]: Go OS/Arch: linux/amd64 Jul 23 11:27:55 k8s-m1 etcd[59889]: setting maximum number of CPUs to 16, total number of available CPUs is 16 Jul 23 11:27:55 k8s-m1 etcd[59889]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd) Jul 23 11:27:55 k8s-m1 etcd[59889]: the server is already initialized as member before, starting as etcd member... Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring peer auto TLS since certs given Jul 23 11:27:55 k8s-m1 etcd[59889]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals> Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for peers on https://10.252.146.104:2380 Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring client auto TLS since certs given Jul 23 11:27:55 k8s-m1 etcd[59889]: pprof is enabled under /debug/pprof Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url. Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 127.0.0.1:2379 Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 10.252.146.104:2379 Jul 23 11:27:55 k8s-m1 etcd[59889]: recovered store from snapshot at index 5952463 Jul 23 11:27:55 k8s-m1 etcd[59889]: restore compact to 4369703 Jul 23 11:27:55 k8s-m1 etcd[59889]: name = etcd-001 Jul 23 11:27:55 k8s-m1 etcd[59889]: data dir = /var/lib/etcd Jul 23 11:27:55 k8s-m1 etcd[59889]: member dir = /var/lib/etcd/member Jul 23 11:27:55 k8s-m1 etcd[59889]: dedicated WAL dir = /var/lib/etcd/wal Jul 23 11:27:55 k8s-m1 etcd[59889]: heartbeat = 100ms Jul 23 11:27:55 k8s-m1 etcd[59889]: election = 1000ms Jul 23 11:27:55 k8s-m1 etcd[59889]: snapshot count = 5000 Jul 23 11:27:55 k8s-m1 etcd[59889]: advertise client URLs = https://10.252.146.104:2379 Jul 23 11:27:55 k8s-m1 etcd[59889]: restarting member ac2dcf6aed12e8f1 in cluster 367e2aebc6430cbe at commit index 5952491 Jul 23 11:27:55 k8s-m1 etcd[59889]: ac2dcf6aed12e8f1 became follower at term 47 Jul 23 11:27:55 k8s-m1 etcd[59889]: newRaft ac2dcf6aed12e8f1 [peers: [1e713be314744d53,8b1621b475555fd9,ac2dcf6aed12e8f1], term: 47, commit: 5952491, applied: 5952463, lastindex: 5952491, lastterm: 47] Jul 23 11:27:55 k8s-m1 etcd[59889]: enabled capabilities for version 3.3 Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 1e713be314744d53 [https://10.252.146.105:2380] to cluster 367e2aebc6430cbe from store Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 8b1621b475555fd9 [https://10.252.146.106:2380] to cluster 367e2aebc6430cbe from store Jul 23 11:27:55 k8s-m1 etcd[59889]: added member ac2dcf6aed12e8f1 [https://10.252.146.104:2380] to cluster 367e2aebc6430cbe from store
查看集群状态
[root@k8s-m1 Kubernetes-ansible]# etcd-ha +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://10.252.146.104:2379 | ac2dcf6aed12e8f1 | 3.3.20 | 8.3 MB | false | 47 | 5953557 | | https://10.252.146.105:2379 | 1e713be314744d53 | 3.3.20 | 8.6 MB | false | 47 | 5953557 | | https://10.252.146.106:2379 | 8b1621b475555fd9 | 3.3.20 | 8.3 MB | true | 47 | 5953557 | +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
然后给kube-apiserver三个组件和kubelet起来后
[root@k8s-m1 Kubernetes-ansible]# kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.252.146.104 Ready <none> 30d v1.16.9 10.252.146.104 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.105 Ready <none> 30d v1.16.9 10.252.146.105 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11 10.252.146.106 Ready <none> 30d v1.16.9 10.252.146.106 <none> CentOS Linux 8 (Core) 4.18.0-193.6.3.el8_2.x86_64 docker://19.3.11
pod也在慢慢自愈了
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- 实战:基于机器学习的智能故障诊断
- Azure如何利用机器学习“预知”虚拟机故障?
- 防患于未然好过亡羊补牢 Azure选择机器学习预测故障
- 故障公告:Linux 内核故障导致网站宕机近 1 个小时
- 线上故障处理手册
- PostgreSQL复制断开故障
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Django 1.0 Template Development
Scott Newman / Packt / 2008 / 24.99
Django is a high-level Python web application framework designed to support the rapid development of dynamic websites, web applications, and web services. Getting the most out of its template system a......一起来看看 《Django 1.0 Template Development》 这本书的介绍吧!