记录一次线上k8s节点故障

内容简介：说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态上去看到相关服务都挂掉了然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/

邮件收到zabbix的告警,业务的网页登陆状态不是200,后面又自愈了

说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态

上去看到相关服务都挂掉了

然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/

[root@cloudos02 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  216G     0 100% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0
[root@cloudos02 /var/log/]# du -shx /var/log/* | grep -P '^\S+?G'
31G   /var/log/heat
1.1G  /var/log/keystone
163G  /var/log/kubernetes
3.7G  /var/log/nova
1.7G  /var/log/openstack-compute

日志文件名有规律,直接删掉20天之前的日志文件

[root@cloudos02 kubernetes]# find -mtime +20 -name 'kube*.cloudos02*' -exec rm -f {} \;
[root@cloudos02 kubernetes]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  117G   91G  57% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0

k8s核心是etcd,果然是etcd有问题,k8s相关服务全部挂了

[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
failed to check the health of member 658a31702f200e95 on http://10.12.0.22:2379: Get http://10.12.0.22:2379/health: dial tcp 10.12.0.22:2379: getsockopt: connection refused
member 658a31702f200e95 is unreachable: [http://10.12.0.22:2379] are all unreachable
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
cluster is healthy

查看日志

[root@cloudos02 ~]# journalctl -xe -u etcd2
一大堆输出说snap.broken

通过日志可以确定etcd的文件损坏了,肯定是由于根分区满了同步过来的数据无法写入导致损坏

先查找etcd的数据目录在哪,解决方法就是删掉此台的数据目录,然后再同步过来就行了

由于是实体服务,直接去找systemd脚本

[root@cloudos02 ~]# cat /usr/lib/systemd/system/etcd2.service 
[Unit]
Description=Etcd2 Server

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/kube-etcd-cluster
ExecStart=/opt/bin/etcd --name=${ETCD_NAME} ......省略

从/usr/lib/systemd/system/etcd2.service看到没有写数据目录,那么默认数据目录是默认为 ${name}.etcd

etcd2.service里的name是从/etc/sysconfig/kube-etcd-cluster里读取变量

[root@cloudos02 ~]# cat /etc/sysconfig/kube-etcd-cluster
ETCD_NAME="NODE2"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.12.0.22:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.12.0.22:2379"
ETCD_INITIAL_CLUSTER_TOKEN="my-etcd-cluster"
ETCD_INITIAL_CLUSTER="NODE1=http://10.12.0.21:2380,NODE2=http://10.12.0.22:2380,NODE3=http://10.12.0.23:2380"
ETCD_INITIAL_CLUSTER_STATE="new"

根目录确实有NODE2.etcd,删掉数据目录

[root@cloudos02 ~]# ll /
drwx------    3 root root  4096 Jun 29 11:47 NODE2.etcd
[root@cloudos02 ~]# rm -rf /NODE2.etcd

去另外正常的节点上移除这个节点,然后再加上

[root@cloudos01 ~]# /opt/bin/etcdctl member remove 658a31702f200e95
Removed member 658a31702f200e95 from cluster
[root@cloudos01 ~]# /opt/bin/etcdctl member add NODE2 http://10.12.0.22:2380

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster

将ETCD_INITIAL_CLUSTER_STATE=new，修改为ETCD_INITIAL_CLUSTER_STATE=existing并启动etcd

[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#new#existing#' /etc/sysconfig/kube-etcd-cluster
[root@cloudos02 ~]# systemctl start etcd2

查看集群成员状态

[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
member f95341f81eb9322c is healthy: got healthy result from http://10.12.0.22:2379
cluster is healthy

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster

将ETCD_INITIAL_CLUSTER_STATE=existing改回new

[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#existing#new#' /etc/sysconfig/kube-etcd-cluster

后面启动相关服务节点完全正常

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Parsing Techniques

Dick Grune、Ceriel J.H. Jacobs / Springer / 2010-2-12 / USD 109.00

This second edition of Grune and Jacobs' brilliant work presents new developments and discoveries that have been made in the field. Parsing, also referred to as syntax analysis, has been and continues......一起来看看《Parsing Techniques》这本书的介绍吧!

码农工具