记录一次线上k8s节点故障

内容简介：说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态上去看到相关服务都挂掉了然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/

邮件收到zabbix的告警,业务的网页登陆状态不是200,后面又自愈了

说明服务挂掉过一次,登陆到机器上发现集群有台节点状态是nodelost状态

上去看到相关服务都挂掉了

然后排查到根分区占满了,排查到是k8s日志堆满了/var/log/

[root@cloudos02 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  216G     0 100% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0
[root@cloudos02 /var/log/]# du -shx /var/log/* | grep -P '^\S+?G'
31G   /var/log/heat
1.1G  /var/log/keystone
163G  /var/log/kubernetes
3.7G  /var/log/nova
1.7G  /var/log/openstack-compute

日志文件名有规律,直接删掉20天之前的日志文件

[root@cloudos02 kubernetes]# find -mtime +20 -name 'kube*.cloudos02*' -exec rm -f {} \;
[root@cloudos02 kubernetes]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  219G  117G   91G  57% /
devtmpfs                  63G     0   63G   0% /dev
tmpfs                     63G   12K   63G   1% /dev/shm
tmpfs                     63G  226M   63G   1% /run
tmpfs                     63G     0   63G   0% /sys/fs/cgroup
/dev/sda3                197M  136M   61M  70% /boot
/dev/sda2                200M     0  200M   0% /boot/efi
tmpfs                     13G     0   13G   0% /run/user/0

k8s核心是etcd,果然是etcd有问题,k8s相关服务全部挂了

[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
failed to check the health of member 658a31702f200e95 on http://10.12.0.22:2379: Get http://10.12.0.22:2379/health: dial tcp 10.12.0.22:2379: getsockopt: connection refused
member 658a31702f200e95 is unreachable: [http://10.12.0.22:2379] are all unreachable
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
cluster is healthy

查看日志

[root@cloudos02 ~]# journalctl -xe -u etcd2
一大堆输出说snap.broken

通过日志可以确定etcd的文件损坏了,肯定是由于根分区满了同步过来的数据无法写入导致损坏

先查找etcd的数据目录在哪,解决方法就是删掉此台的数据目录,然后再同步过来就行了

由于是实体服务,直接去找systemd脚本

[root@cloudos02 ~]# cat /usr/lib/systemd/system/etcd2.service 
[Unit]
Description=Etcd2 Server

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/kube-etcd-cluster
ExecStart=/opt/bin/etcd --name=${ETCD_NAME} ......省略

从/usr/lib/systemd/system/etcd2.service看到没有写数据目录,那么默认数据目录是默认为 ${name}.etcd

etcd2.service里的name是从/etc/sysconfig/kube-etcd-cluster里读取变量

[root@cloudos02 ~]# cat /etc/sysconfig/kube-etcd-cluster
ETCD_NAME="NODE2"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_PEER_URLS="http://10.12.0.22:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.12.0.22:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.12.0.22:2379"
ETCD_INITIAL_CLUSTER_TOKEN="my-etcd-cluster"
ETCD_INITIAL_CLUSTER="NODE1=http://10.12.0.21:2380,NODE2=http://10.12.0.22:2380,NODE3=http://10.12.0.23:2380"
ETCD_INITIAL_CLUSTER_STATE="new"

根目录确实有NODE2.etcd,删掉数据目录

[root@cloudos02 ~]# ll /
drwx------    3 root root  4096 Jun 29 11:47 NODE2.etcd
[root@cloudos02 ~]# rm -rf /NODE2.etcd

去另外正常的节点上移除这个节点,然后再加上

[root@cloudos01 ~]# /opt/bin/etcdctl member remove 658a31702f200e95
Removed member 658a31702f200e95 from cluster
[root@cloudos01 ~]# /opt/bin/etcdctl member add NODE2 http://10.12.0.22:2380

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster

将ETCD_INITIAL_CLUSTER_STATE=new，修改为ETCD_INITIAL_CLUSTER_STATE=existing并启动etcd

[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#new#existing#' /etc/sysconfig/kube-etcd-cluster
[root@cloudos02 ~]# systemctl start etcd2

查看集群成员状态

[root@cloudos02 ~]# /opt/bin/etcdctl cluster-health
member 9bd4565552fd93c is healthy: got healthy result from http://10.12.0.21:2379
member d1a9f9229366f9b8 is healthy: got healthy result from http://10.12.0.23:2379
member f95341f81eb9322c is healthy: got healthy result from http://10.12.0.22:2379
cluster is healthy

然后去异常节点上修改配置文件/etc/sysconfig/kube-etcd-cluster

将ETCD_INITIAL_CLUSTER_STATE=existing改回new

[root@cloudos02 ~]# sed -ri '/ETCD_INITIAL_CLUSTER_STATE/s#existing#new#' /etc/sysconfig/kube-etcd-cluster

后面启动相关服务节点完全正常

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

众包

杰夫·豪 / 牛文静 / 中信出版社 / 2009-6 / 36.00元

本书是继《长尾理论》之后的重要商业书籍。本书回答了《长尾理论》遗留的一大悬念。在长尾中作者详细阐述了长尾之所以成为可能的一个基础，但是没有详细解读，本书就是对这一悬念的详细回答，是《长尾理论》作者强力推荐的图书，在国际上引起了不小的轰动，“众包”这一概念也成为一个标准术语被商界广泛重视。本书大致分为三个部分，介绍众包的现在、过去和未来，解释了它的缘起、普遍性、力量以及商业上的适用性，通俗易懂，精彩......一起来看看《众包》这本书的介绍吧!

码农工具

记录一次线上k8s节点故障

众包

在线进制转换器

HTML 编码/解码

正则表达式在线测试