K8S Node異常問題排查過程
一、簡介
可使用 kubectl 命令行對 K8S Node 異常做初步定位,方法幾乎適用于所有 K8S 集群。
二、排查方法
使用 grafana kubelet 查看 NotReady 發(fā)生時(shí)間:

# kubectl get nodes NAME STATUS ROLES AGE VERSION 172.16.80.4 Ready <none> 18h v1.20.8 172.16.80.6 NotReady <none> 18h v1.20.8
kubectl describe node 172.16.80.6 查看異常 event
Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 13 Jun 2022 19:41:10 +0800 Mon, 13 Jun 2022 19:41:10 +0800 RouteCreated CCE RouteController created a route MemoryPressure Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Tue, 14 Jun 2022 14:08:00 +0800 Tue, 14 Jun 2022 14:09:36 +0800 NodeStatusUnknown Kubelet stopped posting node status.
kubectl get node 172.16.80.6 -o yaml 查看 NodeConditions:
conditions:
- lastHeartbeatTime: "2022-06-13T11:41:10Z"
lastTransitionTime: "2022-06-13T11:41:10Z"
message: CCE RouteController created a route
reason: RouteCreated
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: MemoryPressure
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: DiskPressure
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: PIDPressure
- lastHeartbeatTime: "2022-06-14T06:08:00Z"
lastTransitionTime: "2022-06-14T06:09:36Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: Ready登錄節(jié)點(diǎn)查看 kubelet 的日志:
journalctl -u kubelet --since="2022-06-14 14:00:00" | less
三、常見問題
Kubelet stopped posting node status
kubelet 停止匯報(bào)心跳,通常是 node 節(jié)點(diǎn)宕機(jī),可讓用戶嘗試登錄節(jié)點(diǎn),無法登錄的話,一般通過重啟恢復(fù)。原因一般和節(jié)點(diǎn)負(fù)載有關(guān),可通過監(jiān)控查看節(jié)點(diǎn)異常前負(fù)載情況。
PLEG is not healthy
Pod Lifecycle Event Generator,kubelet 會定期同步 pod 狀態(tài),當(dāng)同步 pod 狀態(tài)超時(shí)(3分鐘),會將 node 置為 not ready 狀態(tài)。
- 通過命令定位是否有容器 inspect 卡住的情況: docker ps -a -q | xargs docker inspect
如果該命令卡住,則進(jìn)一步定位是由具體的哪個(gè)容器導(dǎo)致,通過 docker inspect {CONTAINER ID} 確認(rèn)。定位到具體容器后,經(jīng)客戶允許后可將該容器刪除 docker rm -f {CONTAINER ID}
Node Evicted
當(dāng)節(jié)點(diǎn)因?yàn)橘Y源不足(CPU、內(nèi)存、磁盤)被驅(qū)逐時(shí),需根據(jù)不同原因處理:
- CPU,內(nèi)存資源不足:
- 虛機(jī)升配
- 合理設(shè)置資源的 resource request, 使 pod 合理調(diào)度到不同的節(jié)點(diǎn)上。
- 磁盤空間不足:
- 擴(kuò)容容器數(shù)據(jù)目錄所在磁盤
總結(jié)
以上為個(gè)人經(jīng)驗(yàn),希望能給大家一個(gè)參考,也希望大家多多支持腳本之家。
相關(guān)文章
Kubernetes?DNS解析實(shí)戰(zhàn)過程
Kubernetes中,Pod需Ready狀態(tài)才會被CoreDNS解析,而Service創(chuàng)建時(shí)即添加記錄,當(dāng)服務(wù)依賴Pod解析時(shí)易引發(fā)啟動死循環(huán),通過設(shè)置Service的publishNotReadyAddresses為true,可解決此問題,允許未就緒Pod的IP立即被解析2025-09-09
詳解k8s ConfigMap 中 subPath 字段和 items
volumeMounts.subPath 屬性可用于指定所引用的卷內(nèi)的子路徑,而不是其根路徑,這篇文章主要介紹了詳解k8s ConfigMap 中 subPath 字段和 items 字段,需要的朋友可以參考下2023-03-03
K8S Pod定向部署到指定節(jié)點(diǎn)的實(shí)現(xiàn)全過程
K8S Pod定向部署通過節(jié)點(diǎn)標(biāo)簽、親和性和污點(diǎn)三種機(jī)制實(shí)現(xiàn)資源適配、業(yè)務(wù)隔離與節(jié)點(diǎn)專屬化,適用于不同場景,選型建議為標(biāo)簽用于基礎(chǔ)、親和性用于彈性、污點(diǎn)用于資源保護(hù)2025-08-08

