容器编排 - 加速器之家

K8s排障指南：手把手解决Pod启动失败的五大常见原因

作为容器编排的事实标准，Kubernetes在日常部署中常遇到Pod无法启动的问题。每次看到kubectl get pods返回ImagePullBackOff或CrashLoopBackOff时，开发者都会心头一紧。本文将结合真实案例，拆解五种高频故障场景及修复方案。

一、镜像拉取失败（ImagePullBackOff）

典型报错： Failed to pull image "private-repo/app:v1.2": unauthorized

根本原因：

私有仓库认证缺失
镜像标签拼写错误
网络策略阻断访问

解决方案：

创建docker-registry secret：
kubectl create secret docker-registry my-secret \ --docker-server=registry.example.com \ --docker-username=user --docker-password=pass
在Pod配置中引用secret：
spec: containers: - name: app image: registry.example.com/app:v1.2 imagePullSecrets: - name: my-secret

二、资源配额不足（Pending状态）

现象： Pod卡在Pending状态，kubectl describe pod显示Insufficient cpu/memory

最新技巧： Kubernetes v1.27引入的资源装箱优化能提升节点利用率，但需显式配置Request/Limit：

resources:
  requests:
    memory: "512Mi"
    cpu: "0.5"
  limits:
    memory: "1Gi" 
    cpu: "1"

三、健康检查误杀（CrashLoopBackOff）

经典案例： Spring Boot应用因30秒启动超时被Kubelet重启

优化方案：

调整存活探针初始延迟：
livenessProbe: initialDelaySeconds: 45
使用命令探针替代HTTP：
exec: command: ["pgrep", "java"]

四、持久卷挂载问题（CreateContainerError）

报错： MountVolume.SetUp failed for volume "pvc-db" : mount failed: exit status 32

排查路径：

检查PersistentVolumeClaim(PVC)绑定状态：kubectl get pvc
验证StorageClass是否存在：kubectl get storageclass
确认节点有对应文件系统工具（如NFS客户端）

五、节点污点导致不可调度

现象： Pod始终无法分配到节点，无报错信息

快速诊断：

kubectl describe node <node-name> | grep Taint
# 输出示例：Taints: dedicated=special:NoSchedule

解决方案：在PodSpec添加容忍配置：

tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "special"
  effect: "NoSchedule"

结论：构建系统化排障流程

当Pod启动失败时，遵循以下步骤：
1. kubectl describe pod <pod-name> 查看Events事件
2. kubectl logs <pod-name> --previous 获取上次崩溃日志
3. 使用Popeye进行集群健康扫描
最新版Kubernetes(1.28)的审计日志功能可追踪API请求链，建议生产环境启用。记住：90%的启动问题都藏在describe命令的输出里！