内容简介:说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态上去看到相关服务都挂掉了然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/
- 邮件收到zabbix的告警,业务的网页登陆状态不是200,后面又自愈了
说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态
上去看到相关服务都挂掉了
然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/
[root@cloudos02 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos-root 219G 216G 0 100% / devtmpfs 63G 0 63G 0% /dev tmpfs 63G 12K 63G 1% /dev/shm tmpfs 63G 226M 63G 1% /run tmpfs 63G 0 63G 0% /sys/fs/cgroup /dev/sda3 197M 136M 61M 70% /boot /dev/sda2 200M 0 200M 0% /boot/efi tmpfs 13G 0 13G 0% /run/user/0 [root@cloudos02 /var/log/]# du -shx /var/log/* | grep -P '^\S+?G' 31G /var/log/heat 1.1G /var/log/keystone 163G /var/log/kubernetes 3.7G /var/log/nova 1.7G /var/log/openstack-compute
- 日志文件名有规律,直接删掉20天之前的日志文件
[root@cloudos02 kubernetes]# find -mtime +20 -name 'kube*.cloudos02*' -exec rm -f {} \; [root@cloudos02 kubernetes]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos-root 219G 117G 91G 57% / devtmpfs 63G 0 63G 0% /dev tmpfs 63G 12K 63G 1% /dev/shm tmpfs 63G 226M 63G 1% /run tmpfs 63G 0 63G 0% /sys/fs/cgroup /dev/sda3 197M 136M 61M 70% /boot /dev/sda2 200M 0 200M 0% /boot/efi tmpfs 13G 0 13G 0% /run/user/0
- k8s核心是etcd,果然是etcd有问题,k8s相关服务全部挂了
[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379 failed to check the health of member 658a31702f200e95 on http://10.12.0.22:2379: Get http://10.12.0.22:2379/health: dial tcp 10.12.0.22:2379: getsockopt: connection refused member 658a31702f200e95 is unreachable: [http://10.12.0.22:2379] are all unreachable member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379 cluster is healthy
- 查看日志
[root@cloudos02 ~]# journalctl -xe -u etcd2 一大堆输出说snap.broken
通过日志可以确定etcd的文件损坏了,肯定是由于根分区满了同步过来的数据无法写入导致损坏
先查找etcd的数据目录在哪,解决方法就是删掉此台的数据目录,然后再同步过来就行了
由于是实体服务,直接去找systemd脚本
[root@cloudos02 ~]# cat /usr/lib/systemd/system/etcd2.service [Unit] Description=Etcd2 Server [Service] Type=notify EnvironmentFile=-/etc/sysconfig/kube-etcd-cluster ExecStart=/opt/bin/etcd --name=${ETCD_NAME} ......省略
从/usr/lib/systemd/system/etcd2.service看到没有写数据目录,那么默认数据目录是默认为 ${name}.etcd
etcd2.service里的name是从/etc/sysconfig/kube-etcd-cluster里读取变量
[root@cloudos02 ~]# cat /etc/sysconfig/kube-etcd-cluster ETCD_NAME="NODE2" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.12.0.22:2380" ETCD_LISTEN_PEER_URLS="http://10.12.0.22:2380" ETCD_LISTEN_CLIENT_URLS="http://10.12.0.22:2379,http://127.0.0.1:2379" ETCD_ADVERTISE_CLIENT_URLS="http://10.12.0.22:2379" ETCD_INITIAL_CLUSTER_TOKEN="my-etcd-cluster" ETCD_INITIAL_CLUSTER="NODE1=http://10.12.0.21:2380,NODE2=http://10.12.0.22:2380,NODE3=http://10.12.0.23:2380" ETCD_INITIAL_CLUSTER_STATE="new"
根目录确实有NODE2.etcd,删掉数据目录
[root@cloudos02 ~]# ll / drwx------ 3 root root 4096 Jun 29 11:47 NODE2.etcd [root@cloudos02 ~]# rm -rf /NODE2.etcd
去另外正常的节点上移除这个节点,然后再加上
[root@cloudos01 ~]# /opt/bin/etcdctl member remove 658a31702f200e95 Removed member 658a31702f200e95 from cluster [root@cloudos01 ~]# /opt/bin/etcdctl member add NODE2 http://10.12.0.22:2380
然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster
将ETCD_INITIAL_CLUSTER_STATE=new,修改为ETCD_INITIAL_CLUSTER_STATE=existing并启动etcd
[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#new#existing#' /etc/sysconfig/kube-etcd-cluster [root@cloudos02 ~]# systemctl start etcd2
查看集群成员状态
[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379 member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379 member f95341f81eb9322c is healthy: got healthy result from http://10.12.0.22:2379 cluster is healthy
然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster
将ETCD_INITIAL_CLUSTER_STATE=existing改回new
[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#existing#new#' /etc/sysconfig/kube-etcd-cluster
后面启动相关服务节点完全正常
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- Redis源码解析:集群手动故障转移、从节点迁移详解
- xml创建节点(根节点、子节点)
- 故障公告:Linux 内核故障导致网站宕机近 1 个小时
- Vultr VPS 节点选择方法 | 各节点延迟一览
- 1.19 JQuery2:节点插入与节点选取
- POC分布式节点算法机制下的超级节点计划
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。