etcd 断电数据文件损坏如何恢复? - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
seagull7558
V2EX    Kubernetes

etcd 断电数据文件损坏如何恢复?

  •  
  •   seagull7558 2022-04-22 10:30:50 +08:00 3307 次点击
    这是一个创建于 1269 前的主题,其中的信息可能已经有所发展或是发生改变。

    虽然提前通知会断电重启机器,但是因为之前也遇到过断电的问题,并且一般情况下都是会自己启动 kubelet 和 docker 及相关上层服务,我就没在意,也没有说提前关闭集群,可是万万没想到啊,今天开机后发现 Master 节点居然都挂掉了

    排查过程

    1. 首先登陆主节点,查看各节点状态
    [root@k8s-master wal]# kubectl get nodes The connection to the server 192.168.113.152:6443 was refused - did you specify the right host or port? 

    1. 发现 kubelet 好像是有问题,去看看状态
    systemctl status kubelet 
    [root@k8s-master wal]# systemctl status kubelet ● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since Fri 2022-04-22 09:18:34 CST; 1h 4min ago Docs: https://kubernetes.io/docs/ Main PID: 1017 (kubelet) Tasks: 17 (limit: 75136) Memory: 152.0M CGroup: /system.slice/kubelet.service └─1017 /usr/bin/kubelet --bootstrap-kubecOnfig=/etc/kubernetes/bootstrap-kubelet.conf --kubecOnfig=/etc/kubernetes/kubelet.conf --cOnfig=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-c> 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.289217 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.390603 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.491934 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.593267 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.693429 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.793744 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.894905 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.995215 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.095825 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" 4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.196740 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found" lines 1-22/22 (END) 

    1. 获取主节点失败,看了眼各组件容器日志,发现 apiserver 有点问题 查看对应的 apiserver 容器日志
    2022-04-22T02:21:31.203199623Z I0422 02:21:31.202994 1 server.go:565] external host was not specified, using 192.168.113.152 2022-04-22T02:21:31.231588697Z I0422 02:21:31.231374 1 server.go:172] Version: v1.23.5 2022-04-22T02:21:31.389807666Z I0422 02:21:31.389666 1 shared_informer.go:240] Waiting for caches to sync for node_authorizer 2022-04-22T02:21:31.390038398Z I0422 02:21:31.389913 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook. 2022-04-22T02:21:31.390047004Z I0422 02:21:31.389934 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota. 2022-04-22T02:21:31.390767833Z I0422 02:21:31.390608 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook. 2022-04-22T02:21:31.390778978Z I0422 02:21:31.390639 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota. 2022-04-22T02:21:31.392738574Z W0422 02:21:31.392578 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:32.391488027Z W0422 02:21:32.391315 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:32.394099214Z W0422 02:21:32.393918 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:33.393065852Z W0422 02:21:33.392633 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:33.835046849Z W0422 02:21:33.833992 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:35.158712152Z W0422 02:21:35.158257 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:36.424633514Z W0422 02:21:36.424302 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:38.117720953Z W0422 02:21:38.117282 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:41.173338381Z W0422 02:21:41.172823 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:42.894615034Z W0422 02:21:42.894174 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:47.511910137Z W0422 02:21:47.511585 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:50.459764076Z W0422 02:21:50.459450 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting... 2022-04-22T02:21:51.393624485Z E0422 02:21:51.393263 1 run.go:74] "command failed" err="context deadline exceeded" 

    1. 端口号 2379 接着往下排查 很明显是连接 etcd 报错,这时候我特么很慌啊,etcd 存储了很多 k8s 配置信息,这玩意要是修不好,那许多服务还得重新搭建
    2022-04-22T02:26:23.131382248Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.113.152:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--initial-advertise-peer-urls=https://192.168.113.152:2380","--initial-cluster=k8s-master=https://192.168.113.152:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.113.152:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.113.152:2380","--name=k8s-master","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]} 2022-04-22T02:26:23.131412789Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"} 2022-04-22T02:26:23.131416304Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.113.152:2380"]} 2022-04-22T02:26:23.131418664Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:478","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]} 2022-04-22T02:26:23.131421030Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"]} 2022-04-22T02:26:23.131423955Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.1","git-sha":"e8732fb5f","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"k8s-master","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.113.152:2380"],"listen-peer-urls":["https://192.168.113.152:2380"],"advertise-client-urls":["https://192.168.113.152:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"} 2022-04-22T02:26:23.133621480Z panic: freepages: failed to get all reachable pages (page 478: multiple references) 2022-04-22T02:26:23.133635080Z 2022-04-22T02:26:23.133637664Z goroutine 120 [running]: 2022-04-22T02:26:23.133639420Z go.etcd.io/bbolt.(*DB).freepages.func2(0xc00047e540) 2022-04-22T02:26:23.133648190Z /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/[email protected]/db.go:1056 +0xe9 2022-04-22T02:26:23.133650552Z created by go.etcd.io/bbolt.(*DB).freepages 2022-04-22T02:26:23.133652184Z /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/[email protected]/db.go:1054 +0x1cd 

    然后重启 etcd 也不好使,一番搜索发现是装在 vmware 里会有问题,可能导致 etcd 文件损坏,寄!

    关联 issues

    9 条回复    2022-07-17 09:10:31 +08:00
    rrfeng
        1
    rrfeng  
       2022-04-22 10:41:43 +08:00 via Android
    用之前的 snapshot 恢复

    这玩意也要定时离线备份的…
    seagull7558
        2
    seagull7558  
    OP
       2022-04-22 10:43:40 +08:00
    @rrfeng 老哥是指删掉 /var/lib/etcd/member/snap 文件夹下最新的.snap 文件吗?
    FullBridgeRect
        3
    FullBridgeRect  
       2022-04-22 10:57:26 +08:00
    etcd 有做高可用么,达到容灾上限之前可以直接开新节点重建吧?
    seagull7558
        4
    seagull7558  
    OP
       2022-04-22 11:05:26 +08:00
    @FullBridgeRect 没做高可用 本来就是开发环境 为了模拟配置的 确实偷懒没做高可用 也没做备份
    FullBridgeRect
        5
    FullBridgeRect  
       2022-04-22 11:06:34 +08:00
    @seagull7558 那没了啊
    disk
        6
    disk  
       2022-04-22 11:14:54 +08:00
    备份下,*.wal 或*.snap 删了看好不好使,不行看看能不能从最新的.snap 恢复,再不行就没必要再试图恢复了,成本太高。
    seagull7558
        7
    seagull7558  
    OP
       2022-04-22 11:29:39 +08:00
    @disk 试过直接删掉 wal 和 snap 文件夹 会丢失 K8S 集群下的所有配置信息
    zhaoyeye
        8
    zhaoyeye  
       2022-04-23 00:07:41 +08:00 via Android
    @seagull7558 我前两天刚通过 snapshot 恢复了一次,反正是挺多坑的
    stevefan1999
        9
    stevefan1999  
       2022-07-17 09:10:31 +08:00 via Android
    etcd 背後用的是 BoltDB 理上是毋恢的 如果集有其他 HA master 那把有的完整地了 再不行找份
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     853 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 25ms UTC 22:17 PVG 06:17 LAX 15:17 JFK 18:17
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86