记录一次线上k8s节点故障

栏目: 服务器 · Linux · 发布时间: 5年前

内容简介:说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态上去看到相关服务都挂掉了然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/
  • 邮件收到zabbix的告警,业务的网页登陆状态不是200,后面又自愈了

说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态

上去看到相关服务都挂掉了

然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/

[root@cloudos02 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  216G     0 100% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0
[root@cloudos02 /var/log/]# du -shx /var/log/* | grep -P '^\S+?G'
31G   /var/log/heat
1.1G  /var/log/keystone
163G  /var/log/kubernetes
3.7G  /var/log/nova
1.7G  /var/log/openstack-compute
  • 日志文件名有规律,直接删掉20天之前的日志文件
[root@cloudos02 kubernetes]# find -mtime +20 -name 'kube*.cloudos02*' -exec rm -f {} \;
[root@cloudos02 kubernetes]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  117G   91G  57% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0
  • k8s核心是etcd,果然是etcd有问题,k8s相关服务全部挂了
[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
failed to check the health of member 658a31702f200e95 on http://10.12.0.22:2379: Get http://10.12.0.22:2379/health: dial tcp 10.12.0.22:2379: getsockopt: connection refused
member 658a31702f200e95 is unreachable: [http://10.12.0.22:2379] are all unreachable
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
cluster is healthy
  • 查看日志
[root@cloudos02 ~]# journalctl -xe -u etcd2
一大堆输出说snap.broken

通过日志可以确定etcd的文件损坏了,肯定是由于根分区满了同步过来的数据无法写入导致损坏

先查找etcd的数据目录在哪,解决方法就是删掉此台的数据目录,然后再同步过来就行了

由于是实体服务,直接去找systemd脚本

[root@cloudos02 ~]# cat /usr/lib/systemd/system/etcd2.service 
[Unit]
Description=Etcd2 Server

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/kube-etcd-cluster
ExecStart=/opt/bin/etcd --name=${ETCD_NAME} ......省略

从/usr/lib/systemd/system/etcd2.service看到没有写数据目录,那么默认数据目录是默认为 ${name}.etcd

etcd2.service里的name是从/etc/sysconfig/kube-etcd-cluster里读取变量

[root@cloudos02 ~]# cat /etc/sysconfig/kube-etcd-cluster
ETCD_NAME="NODE2"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.12.0.22:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.12.0.22:2379"
ETCD_INITIAL_CLUSTER_TOKEN="my-etcd-cluster"
ETCD_INITIAL_CLUSTER="NODE1=http://10.12.0.21:2380,NODE2=http://10.12.0.22:2380,NODE3=http://10.12.0.23:2380"
ETCD_INITIAL_CLUSTER_STATE="new"

根目录确实有NODE2.etcd,删掉数据目录

[root@cloudos02 ~]# ll /
drwx------    3 root root  4096 Jun 29 11:47 NODE2.etcd
[root@cloudos02 ~]# rm -rf /NODE2.etcd

去另外正常的节点上移除这个节点,然后再加上

[root@cloudos01 ~]# /opt/bin/etcdctl member remove 658a31702f200e95
Removed member 658a31702f200e95 from cluster
[root@cloudos01 ~]# /opt/bin/etcdctl member add NODE2 http://10.12.0.22:2380

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster

将ETCD_INITIAL_CLUSTER_STATE=new,修改为ETCD_INITIAL_CLUSTER_STATE=existing并启动etcd

[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#new#existing#' /etc/sysconfig/kube-etcd-cluster
[root@cloudos02 ~]# systemctl start etcd2

查看集群成员状态

[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
member f95341f81eb9322c is healthy: got healthy result from http://10.12.0.22:2379
cluster is healthy

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster

将ETCD_INITIAL_CLUSTER_STATE=existing改回new

[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#existing#new#' /etc/sysconfig/kube-etcd-cluster

后面启动相关服务节点完全正常


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Handbook of Data Structures and Applications

Handbook of Data Structures and Applications

Dinesh P. Mehta / Chapman and Hall/CRC / 2004-10-28 / USD 135.95

In the late sixties, Donald Knuth, winner of the 1974Turing Award, published his landmark book The Art of Computer Programming: Fundamental Algorithms. This book brought to- gether a body of kno......一起来看看 《Handbook of Data Structures and Applications》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

SHA 加密
SHA 加密

SHA 加密工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具