2 篇博文含有标签「HPA」

Horizontal Pod Autoscaler 水平自动伸缩

查看所有标签

Kubernetes 全景解析 (6)：生产级微服务架构实战

2026年4月9日 · 阅读需 38 分钟

Rainy

雨落无声，代码成诗 —— 致力于技术与艺术的极致平衡

"纸上得来终觉浅，绝知此事要躬行。"

在前面的系列文章中，我们系统学习了 K8s 的架构设计、工作负载管理、网络模型、存储体系与配置管理。现在是时候将这些知识串联起来，完成一次从零到生产的完整实战演练。

本文将以一个电商微服务系统为蓝本，手把手带你完成以下全流程：

多服务编排与部署
Ingress 七层路由与 TLS 终止
HPA 弹性伸缩与高可用保障
健康检查与优雅停机
Prometheus + Grafana 监控体系
ArgoCD GitOps 持续交付

所有 YAML 配置均基于 Kubernetes v1.34（Of Wind & Will）官方 API 验证，可直接应用于你的集群。

API 版本说明

本文所有 YAML 配置使用的 API 版本均经过 Kubernetes v1.34 官方文档校验：

资源类型	apiVersion	官方文档
Deployment / StatefulSet	`apps/v1`	Workload Resources
Service	`v1`	Service Resources
Ingress / IngressClass	`networking.k8s.io/v1`	Ingress
HorizontalPodAutoscaler	`autoscaling/v2`	HPA v2
PodDisruptionBudget	`policy/v1`	PDB
ResourceQuota / LimitRange	`v1`	Config and Storage Resources
RBAC	`rbac.authorization.k8s.io/v1`	Authorization Resources
ServiceMonitor	`monitoring.coreos.com/v1`	Prometheus Operator
ArgoCD Application	`argoproj.io/v1alpha1`	ArgoCD CRDs
StorageClass	`storage.k8s.io/v1`	Storage Resources

一、实战场景概述

1.1 电商微服务架构设计

我们以一个典型的电商系统为例，将其拆分为以下五个核心微服务：

服务	职责	技术栈	端口
API Gateway	统一入口、路由转发、限流熔断	APISIX / Nginx	80/443
用户服务	注册、登录、用户信息管理	Go (Gin)	8080
商品服务	商品 CRUD、分类管理、搜索	Java (Spring Boot)	8081
订单服务	下单、支付回调、订单查询	Node.js (Express)	8082
支付服务	支付对接、退款、对账	Go (Gin)	8083

底层依赖两个有状态服务：

服务	职责	端口
PostgreSQL	关系型数据库（用户、订单）	5432
Redis	缓存、会话管理、分布式锁	6379

1.2 微服务架构拓扑

1.3 技术栈选择

层级	技术选型	选型理由
网关层	APISIX	高性能、支持 gRPC 转发、插件生态丰富
服务层	Go + Java + Node.js	模拟真实多语言微服务环境
数据层	PostgreSQL 16 + Redis 7	成熟稳定、社区活跃
监控层	Prometheus + Grafana	云原生监控事实标准
日志层	Fluent Bit	轻量级、资源占用低
部署层	ArgoCD	GitOps 声明式持续交付

二、基础设施准备

2.1 命名空间与资源配额

生产环境中，不同团队的服务应该隔离在不同的命名空间中，并通过 ResourceQuota 限制资源使用。

k8s/00-namespace/namespace.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: microservices
  labels:
    app.kubernetes.io/part-of: ecommerce
    app.kubernetes.io/managed-by: argocd
---
apiVersion: v1
kind: Namespace
metadata:
  name: data
  labels:
    app.kubernetes.io/part-of: ecommerce
    app.kubernetes.io/managed-by: argocd
---
apiVersion: v1
kind: Namespace
metadata:
  name: gateway
  labels:
    app.kubernetes.io/part-of: ecommerce
    app.kubernetes.io/managed-by: argocd

k8s/00-namespace/resource-quota.yaml
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: microservices-quota
  namespace: microservices
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    services: "20"
    persistentvolumeclaims: "10"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: microservices
spec:
  limits:
    - default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "2Gi"
      min:
        cpu: "50m"
        memory: "64Mi"
      type: Container

生产环境注意事项

ResourceQuota 的值应根据集群总容量和业务优先级进行合理分配。建议预留 20% 的资源缓冲，避免某个命名空间的突发流量影响其他业务。LimitRange 确保即使开发者忘记设置资源限制，容器也不会无限制地消耗节点资源。

2.2 ConfigMap 与 Secret 配置管理

将配置从镜像中分离出来，是 12-Factor App 的核心原则之一。

k8s/01-config/configmap.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: microservices
data:
  # 数据库连接配置
  DB_HOST: "postgresql.data.svc.cluster.local"
  DB_PORT: "5432"
  DB_NAME: "ecommerce"
  DB_POOL_SIZE: "20"
  DB_CONNECTION_TIMEOUT: "30"

  # Redis 连接配置
  REDIS_HOST: "redis.data.svc.cluster.local"
  REDIS_PORT: "6379"
  REDIS_DB: "0"
  REDIS_POOL_SIZE: "50"

  # 日志配置
  LOG_LEVEL: "info"
  LOG_FORMAT: "json"

  # 服务间调用超时
  SERVICE_TIMEOUT: "10s"
  SERVICE_RETRY: "3"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gateway-config
  namespace: gateway
data:
  # 网关路由配置
  UPSTREAM_USER: "user-service.microservices.svc.cluster.local:8080"
  UPSTREAM_PRODUCT: "product-service.microservices.svc.cluster.local:8081"
  UPSTREAM_ORDER: "order-service.microservices.svc.cluster.local:8082"
  UPSTREAM_PAYMENT: "payment-service.microservices.svc.cluster.local:8083"

k8s/01-config/secret.yaml
---
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
  namespace: microservices
type: Opaque
stringData:
  DB_USERNAME: "ecommerce_app"
  DB_PASSWORD: "S3cureP@ssw0rd!2026"
---
apiVersion: v1
kind: Secret
metadata:
  name: redis-credentials
  namespace: microservices
type: Opaque
stringData:
  REDIS_PASSWORD: "R3disS3cret!2026"
---
apiVersion: v1
kind: Secret
metadata:
  name: tls-secret
  namespace: gateway
type: kubernetes.io/tls
stringData:
  tls.crt: |
    -----BEGIN CERTIFICATE-----
    # 此处替换为你的 TLS 证书
    -----END CERTIFICATE-----
  tls.key: |
    -----BEGIN PRIVATE KEY-----
    # 此处替换为你的 TLS 私钥
    -----END PRIVATE KEY-----

安全警告

切勿将 Secret 明文提交到 Git 仓库！ 生产环境应使用以下方案之一：

Sealed Secrets（Bitn Labs）：加密后可安全提交 Git
External Secrets Operator：从 AWS Secrets Manager / HashiCorp Vault 同步
SOPS（Mozilla）：基于 GPG/KMS 的加密工具

本文示例中的 stringData 仅用于演示，生产环境请务必使用加密方案。

2.3 PV/PVC 持久化存储规划

k8s/02-storage/storageclass.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs  # 根据云厂商调整
parameters:
  type: gp3
  fsType: ext4
  iopsPerGB: "50"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-hdd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  fsType: ext4
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate

k8s/02-storage/pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgresql-data
  namespace: data
spec:
  storageClassName: fast-ssd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-data
  namespace: data
spec:
  storageClassName: fast-ssd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

三、核心服务部署

3.1 API Gateway 部署

网关是整个系统的统一入口，负责路由转发、限流、熔断和认证。

k8s/03-services/gateway-deployment.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  namespace: gateway
  labels:
    app: api-gateway
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-gateway
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: api-gateway
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: api-gateway
                topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 60
      containers:
        - name: apisix
          image: apache/apisix:3.9.0-debian
          ports:
            - name: http
              containerPort: 9080
              protocol: TCP
            - name: https
              containerPort: 9443
              protocol: TCP
            - name: metrics
              containerPort: 9090
              protocol: TCP
          envFrom:
            - configMapRef:
                name: gateway-config
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 9090
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /healthz
              port: 9090
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
---
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
  namespace: gateway
  labels:
    app: api-gateway
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 80
      targetPort: 9080
      protocol: TCP
    - name: https
      port: 443
      targetPort: 9443
      protocol: TCP
  selector:
    app: api-gateway

3.2 用户服务部署

k8s/03-services/user-service.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
  namespace: microservices
  labels:
    app: user-service
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: user-service
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: user-service
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: user-service
                topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 30
      containers:
        - name: user-service
          image: registry.example.com/ecommerce/user-service:v1.0.0
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
            - name: grpc
              containerPort: 9090
              protocol: TCP
          env:
            - name: SERVICE_NAME
              value: "user-service"
            - name: SERVICE_PORT
              value: "8080"
            - name: DB_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_HOST
            - name: DB_PORT
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_PORT
            - name: DB_NAME
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_NAME
            - name: DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_USERNAME
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_PASSWORD
            - name: REDIS_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: REDIS_HOST
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: redis-credentials
                  key: REDIS_PASSWORD
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
---
apiVersion: v1
kind: Service
metadata:
  name: user-service
  namespace: microservices
  labels:
    app: user-service
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 8080
      targetPort: 8080
      protocol: TCP
    - name: grpc
      port: 9090
      targetPort: 9090
      protocol: TCP
  selector:
    app: user-service

3.3 订单服务部署

k8s/03-services/order-service.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: microservices
  labels:
    app: order-service
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: order-service
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8082"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: order-service
                topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 60
      containers:
        - name: order-service
          image: registry.example.com/ecommerce/order-service:v1.0.0
          ports:
            - name: http
              containerPort: 8082
              protocol: TCP
          env:
            - name: SERVICE_NAME
              value: "order-service"
            - name: SERVICE_PORT
              value: "8082"
            - name: DB_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_HOST
            - name: DB_PORT
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_PORT
            - name: DB_NAME
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_NAME
            - name: DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_USERNAME
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_PASSWORD
            - name: REDIS_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: REDIS_HOST
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: redis-credentials
                  key: REDIS_PASSWORD
            - name: USER_SERVICE_URL
              value: "http://user-service.microservices.svc.cluster.local:8080"
            - name: PRODUCT_SERVICE_URL
              value: "http://product-service.microservices.svc.cluster.local:8081"
            - name: PAYMENT_SERVICE_URL
              value: "http://payment-service.microservices.svc.cluster.local:8083"
          resources:
            requests:
              cpu: 300m
              memory: 384Mi
            limits:
              cpu: "1"
              memory: 768Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8082
            initialDelaySeconds: 20
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8082
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
---
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: microservices
  labels:
    app: order-service
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 8082
      targetPort: 8082
      protocol: TCP
  selector:
    app: order-service

3.4 商品服务部署

k8s/03-services/product-service.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: product-service
  namespace: microservices
  labels:
    app: product-service
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: product-service
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: product-service
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8081"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: product-service
                topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 30
      containers:
        - name: product-service
          image: registry.example.com/ecommerce/product-service:v1.0.0
          ports:
            - name: http
              containerPort: 8081
              protocol: TCP
          env:
            - name: SERVICE_NAME
              value: "product-service"
            - name: SERVICE_PORT
              value: "8081"
            - name: DB_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_HOST
            - name: DB_PORT
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_PORT
            - name: DB_NAME
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_NAME
            - name: DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_USERNAME
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_PASSWORD
            - name: REDIS_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: REDIS_HOST
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: redis-credentials
                  key: REDIS_PASSWORD
          resources:
            requests:
              cpu: 300m
              memory: 512Mi
            limits:
              cpu: "2"
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8081
            initialDelaySeconds: 30
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8081
            initialDelaySeconds: 15
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
---
apiVersion: v1
kind: Service
metadata:
  name: product-service
  namespace: microservices
  labels:
    app: product-service
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 8081
      targetPort: 8081
      protocol: TCP
  selector:
    app: product-service

3.5 支付服务部署

k8s/03-services/payment-service.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: microservices
  labels:
    app: payment-service
    version: v1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: payment-service
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: payment-service
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8083"
        prometheus.io/path: "/metrics"
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: payment-service
                topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 60
      containers:
        - name: payment-service
          image: registry.example.com/ecommerce/payment-service:v1.0.0
          ports:
            - name: http
              containerPort: 8083
              protocol: TCP
          env:
            - name: SERVICE_NAME
              value: "payment-service"
            - name: SERVICE_PORT
              value: "8083"
            - name: DB_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_HOST
            - name: DB_PORT
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_PORT
            - name: DB_NAME
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: DB_NAME
            - name: DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_USERNAME
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_PASSWORD
            - name: PAYMENT_GATEWAY_KEY
              valueFrom:
                secretKeyRef:
                  name: payment-secret
                  key: GATEWAY_API_KEY
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8083
            initialDelaySeconds: 15
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8083
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
---
apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: microservices
  labels:
    app: payment-service
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 8083
      targetPort: 8083
      protocol: TCP
  selector:
    app: payment-service

3.6 数据库部署（StatefulSet）

有状态服务使用 StatefulSet 部署，确保稳定的网络标识和持久化存储。

k8s/03-services/postgresql-statefulset.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: data
  labels:
    app: postgresql
spec:
  serviceName: postgresql-headless
  replicas: 1
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9187"
        prometheus.io/path: "/metrics"
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: postgresql
          image: postgres:16-alpine
          ports:
            - name: postgresql
              containerPort: 5432
              protocol: TCP
          env:
            - name: POSTGRES_DB
              value: "ecommerce"
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_USERNAME
                  optional: false
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: DB_PASSWORD
                  optional: false
            - name: PGDATA
              value: "/var/lib/postgresql/data/pgdata"
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: "2"
              memory: 4Gi
          volumeMounts:
            - name: postgresql-data
              mountPath: /var/lib/postgresql/data
          livenessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - ecommerce_app
                - -d
                - ecommerce
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - ecommerce_app
                - -d
                - ecommerce
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "pg_ctl stop -m fast"]
        - name: postgres-exporter
          image: prometheuscommunity/postgres-exporter:v0.15.0
          ports:
            - name: metrics
              containerPort: 9187
              protocol: TCP
          env:
            - name: DATA_SOURCE_NAME
              value: "postgresql://ecommerce_app:S3cureP@ssw0rd!2026@localhost:5432/ecommerce?sslmode=disable"
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 128Mi
  volumeClaimTemplates:
    - metadata:
        name: postgresql-data
      spec:
        storageClassName: fast-ssd
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
  name: postgresql-headless
  namespace: data
  labels:
    app: postgresql
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: postgresql
      port: 5432
      targetPort: 5432
      protocol: TCP
  selector:
    app: postgresql
---
apiVersion: v1
kind: Service
metadata:
  name: postgresql
  namespace: data
  labels:
    app: postgresql
spec:
  type: ClusterIP
  ports:
    - name: postgresql
      port: 5432
      targetPort: 5432
      protocol: TCP
  selector:
    app: postgresql

k8s/03-services/redis-statefulset.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  namespace: data
  labels:
    app: redis
spec:
  serviceName: redis-headless
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9121"
        prometheus.io/path: "/metrics"
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: redis
          image: redis:7-alpine
          command:
            - redis-server
            - --requirepass
            - $(REDIS_PASSWORD)
            - --maxmemory
            - 1gb
            - --maxmemory-policy
            - allkeys-lru
            - --appendonly
            - "yes"
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
          env:
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: redis-credentials
                  key: REDIS_PASSWORD
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
            - name: redis-data
              mountPath: /data
          livenessProbe:
            exec:
              command:
                - redis-cli
                - -a
                - $(REDIS_PASSWORD)
                - ping
            initialDelaySeconds: 15
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            exec:
              command:
                - redis-cli
                - -a
                - $(REDIS_PASSWORD)
                - ping
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "redis-cli -a $(REDIS_PASSWORD) SHUTDOWN NOSAVE"]
        - name: redis-exporter
          image: oliver006/redis_exporter:v1.58.0
          ports:
            - name: metrics
              containerPort: 9121
              protocol: TCP
          env:
            - name: REDIS_ADDR
              value: "redis://localhost:6379"
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: redis-credentials
                  key: REDIS_PASSWORD
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 100m
              memory: 128Mi
  volumeClaimTemplates:
    - metadata:
        name: redis-data
      spec:
        storageClassName: fast-ssd
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
---
apiVersion: v1
kind: Service
metadata:
  name: redis-headless
  namespace: data
  labels:
    app: redis
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: redis
      port: 6379
      targetPort: 6379
      protocol: TCP
  selector:
    app: redis
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: data
  labels:
    app: redis
spec:
  type: ClusterIP
  ports:
    - name: redis
      port: 6379
      targetPort: 6379
      protocol: TCP
  selector:
    app: redis

StatefulSet vs Deployment

对于数据库和缓存等有状态服务，务必使用 StatefulSet 而非 Deployment。StatefulSet 提供了以下保证：

稳定的网络标识：每个 Pod 有固定的 DNS 名称（如 postgresql-0.postgresql-headless.data.svc.cluster.local）
有序的部署和扩缩容：Pod 按序号顺序创建和删除
稳定的持久化存储：通过 volumeClaimTemplates，每个 Pod 绑定独立的 PVC，Pod 重建后自动重新挂载

四、服务间通信与配置

4.1 CoreDNS 服务发现

Kubernetes 内置的 CoreDNS 为每个 Service 自动创建 DNS 记录，服务间可以通过标准的 FQDN 进行互相访问：

记录格式	示例	作用域
`<service>`	`user-service`	同命名空间
`<service>.<namespace>`	`user-service.microservices`	跨命名空间
`<service>.<namespace>.svc.cluster.local`	`user-service.microservices.svc.cluster.local`	全集群

最佳实践

始终使用完整的跨命名空间 FQDN（如 user-service.microservices.svc.cluster.local），即使服务在同一个命名空间中。这样在后续调整命名空间划分时，不需要修改代码和配置。

4.2 健康检查设计原则

健康检查是保障服务可用性的关键机制。K8s 提供了三种探针：

探针类型	用途	失败后果
Liveness Probe	检测容器是否存活	重启容器
Readiness Probe	检测是否可以接收流量	从 Service Endpoints 中移除
Startup Probe	检测应用是否启动完成	启动完成前禁用其他探针

以下是健康检查的配置要点：

k8s/04-communication/health-check-example.yaml
# 以订单服务为例，展示完整的健康检查配置
spec:
  containers:
    - name: order-service
      # ...其他配置...
      startupProbe:
        httpGet:
          path: /healthz
          port: 8082
        failureThreshold: 30    # 最多等待 30 * 10s = 300s
        periodSeconds: 10
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8082
        initialDelaySeconds: 0   # startupProbe 完成后才开始
        periodSeconds: 15
        timeoutSeconds: 5
        failureThreshold: 3      # 连续 3 次失败则重启
        successThreshold: 1
      readinessProbe:
        httpGet:
          path: /readyz
          port: 8082
        initialDelaySeconds: 0
        periodSeconds: 10
        timeoutSeconds: 3
        failureThreshold: 3      # 连续 3 次失败则摘除流量
        successThreshold: 1

生产环境注意事项

Liveness 和 Readiness 必须使用不同的端点。Liveness 检测"进程是否活着"，Readiness 检测"是否准备好接收请求"。如果两者使用同一端点，可能导致级联故障——例如数据库短暂不可用时，所有 Pod 同时被 Liveness 重启。
设置合理的 timeoutSeconds。过短的超时时间会导致误判，建议设置为 P99 响应时间的 2-3 倍。
对于 Java 等启动较慢的服务，务必配置 Startup Probe，否则 Liveness Probe 可能在应用启动期间就触发重启。

4.3 优雅停机

优雅停机确保 Pod 在被终止时，能够完成正在处理的请求并安全释放资源。

v1.34 新特性：原生 Sleep Lifecycle Hook

Kubernetes v1.34 引入了原生的 sleep action 用于 PreStop 和 PostStart lifecycle hooks，无需再通过 exec 执行 sleep 命令。这提供了更简洁和可靠的优雅停机方式。

详见官方文档：Container Lifecycle Hooks

推荐方式（v1.34+）：使用原生 Sleep Hook

k8s/04-communication/graceful-shutdown-v134.yaml
spec:
  terminationGracePeriodSeconds: 60  # 给予 60 秒的优雅停机时间
  containers:
    - name: order-service
      lifecycle:
        preStop:
          sleep:
            seconds: 15  # 原生 sleep action
      # ...其他配置...

传统方式（v1.34 之前）：使用 Exec Hook

k8s/04-communication/graceful-shutdown-legacy.yaml
spec:
  terminationGracePeriodSeconds: 60  # 给予 60 秒的优雅停机时间
  containers:
    - name: order-service
      lifecycle:
        preStop:
          exec:
            # 先等待 15 秒，让 Service Endpoints 更新
            # 然后发送 SIGTERM 信号，应用开始优雅关闭
            command: ["/bin/sh", "-c", "sleep 15"]
      # ...其他配置...

优雅停机的完整流程如下：

K8s 向 Pod 发送 SIGTERM 信号
preStop Hook 执行 sleep 15（或原生 sleep action），等待 15 秒（让 Ingress/Service 感知到 Pod 不可用）
应用收到 SIGTERM 后，停止接收新请求，处理完正在进行的请求
应用关闭数据库连接池、释放资源
如果超过 terminationGracePeriodSeconds（60s）仍未退出，K8s 发送 SIGKILL 强制终止

为什么 preStop 需要 sleep？

K8s 在执行 preStop Hook 的同时，会从 Service Endpoints 中移除该 Pod。但 Ingress Controller 和上游代理可能还有缓存，需要一定时间才能感知到变化。sleep 15 确保在应用真正开始关闭之前，所有上游组件都已经停止向该 Pod 转发流量。

五、Ingress 七层路由配置

5.1 Ingress Controller 部署

我们使用 APISIX Ingress Controller 作为七层路由组件。

API 版本说明：Ingress 和 IngressClass 使用 networking.k8s.io/v1，这是 Kubernetes v1.34 中的稳定版本。networking.k8s.io/v1beta1 已在 v1.22 中移除，请确保使用 v1。详见官方文档：Ingress v1

k8s/05-ingress/ingress-controller.yaml
---
# RBAC 配置
# apiVersion: rbac.authorization.k8s.io/v1 是 Kubernetes v1.34 中的稳定版本
# 详见官方文档：https://kubernetes.io/docs/reference/kubernetes-api/authorization-resources/role-v1/
apiVersion: v1
kind: ServiceAccount
metadata:
  name: apisix-ingress-controller
  namespace: gateway
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: apisix-ingress-controller
rules:
  - apiGroups: [""]
    resources: ["secrets", "services", "endpoints"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["ingresses", "ingressclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apisix.apache.org"]
    resources: ["apisixroutes", "apisixupstreams", "apisixtlsconfigs", "apisixclusters", "apisixpluginconfigs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: apisix-ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: apisix-ingress-controller
subjects:
  - kind: ServiceAccount
    name: apisix-ingress-controller
    namespace: gateway
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: apisix-ingress-controller
  namespace: gateway
  labels:
    app: apisix-ingress-controller
spec:
  replicas: 2
  selector:
    matchLabels:
      app: apisix-ingress-controller
  template:
    metadata:
      labels:
        app: apisix-ingress-controller
    spec:
      serviceAccountName: apisix-ingress-controller
      containers:
        - name: apisix-ingress-controller
          image: apache/apisix-ingress-controller:1.8.0
          args:
            - --ingress-class
            - apisix
            - --apisix-admin-api-version
            - v3
            - --log-level
            - info
            - --http-port
            - "8080"
          env:
            - name: APISIX_ADMIN_API_URL
              value: "http://apisix-admin.gateway.svc.cluster.local:9180/apisix/admin"
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

5.2 Ingress 路由规则

k8s/05-ingress/ingress-routes.yaml
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: apisix
spec:
  controller: apache.org/apisix-ingress-controller
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ecommerce-ingress
  namespace: gateway
  annotations:
    kubernetes.io/ingress.class: apisix
    # 启用 CORS
    apisix.apache.org/enable-cors: "true"
    apisix.apache.org/cors-allow-origin: "https://shop.example.com"
    apisix.apache.org/cors-allow-methods: "GET,POST,PUT,DELETE,OPTIONS"
    apisix.apache.org/cors-allow-headers: "Authorization,Content-Type"
    # 全局限流
    apisix.apache.org/plugin-limit-count: |
      {
        "count": 1000,
        "time_window": 1,
        "rejected_code": 429,
        "key": "remote_addr"
      }
spec:
  ingressClassName: apisix
  tls:
    - hosts:
        - api.example.com
      secretName: tls-secret
  rules:
    - host: api.example.com
      http:
        paths:
          # 用户服务路由
          - path: /api/v1/users
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80
          # 商品服务路由
          - path: /api/v1/products
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80
          # 订单服务路由
          - path: /api/v1/orders
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80
          # 支付服务路由
          - path: /api/v1/payments
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

路由设计说明

在上述配置中，所有业务路由都指向 API Gateway，由 Gateway 负责将请求转发到具体的后端服务。这种设计的好处是：

Gateway 统一处理认证、限流、熔断等横切关注点
后端服务不需要暴露到 Ingress 层
路由规则变更只需修改 Gateway 配置，无需修改 Ingress

六、弹性伸缩与高可用

6.1 HPA 弹性伸缩

Horizontal Pod Autoscaler 根据监控指标自动调整 Pod 副本数，是应对流量波动的核心机制。

API 版本说明：HPA 使用 autoscaling/v2，这是 Kubernetes v1.34 中的稳定版本，支持资源指标（CPU/内存）、Pods 指标和外部指标，以及 behavior 字段精细控制扩缩容行为。详见官方文档：HorizontalPodAutoscaler v2

k8s/06-scalability/hpa.yaml
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
  namespace: microservices
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    # CPU 利用率目标
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # 内存利用率目标
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # 扩容稳定窗口
      policies:
        - type: Pods
          value: 4                       # 每次最多扩容 4 个 Pod
          periodSeconds: 60
        - type: Percent
          value: 100                     # 或每次扩容当前副本数的 100%
          periodSeconds: 60
      selectPolicy: Max                  # 取两个策略中更激进的
    scaleDown:
      stabilizationWindowSeconds: 300   # 缩容稳定窗口 5 分钟
      policies:
        - type: Pods
          value: 2                       # 每次最多缩容 2 个 Pod
          periodSeconds: 120
      selectPolicy: Min                  # 取两个策略中更保守的
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
  namespace: microservices
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    # 自定义指标：HTTP 请求延迟 P99
    - type: Pods
      pods:
        metric:
          name: http_request_duration_seconds_p99
        target:
          type: AverageValue
          averageValue: "500m"  # 500ms
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 6
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

生产环境注意事项

扩容要快，缩容要慢：scaleUp.stabilizationWindowSeconds 应设置较小值（30-60s），scaleDown.stabilizationWindowSeconds 应设置较大值（300-600s），避免因流量短暂下降导致频繁缩容。
设置合理的 minReplicas：最小副本数不应低于 2，且应通过 Pod 反亲和性分布在不同节点上，确保单节点故障不影响服务可用性。
自定义指标需要安装 Metrics Server 和 Prometheus Adapter，否则 HPA 无法获取自定义指标。

6.2 Pod 反亲和性与 PDB

API 版本说明：PodDisruptionBudget 使用 policy/v1，这是 Kubernetes v1.34 中的稳定版本。PDB 新增了 unhealthyPodEvictionPolicy 字段，支持 IfHealthyBudget 和 AlwaysAllow 两种策略，更灵活地控制不健康 Pod 的驱逐行为。详见官方文档：PodDisruptionBudget v1

k8s/06-scalability/pdb.yaml
---
# 确保用户服务至少有 2 个可用 Pod
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: user-service-pdb
  namespace: microservices
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: user-service
---
# 确保订单服务至少有 50% 的 Pod 可用
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: order-service-pdb
  namespace: microservices
spec:
  maxUnavailable: "50%"
  selector:
    matchLabels:
      app: order-service
---
# 确保支付服务至少有 1 个可用 Pod
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-service-pdb
  namespace: microservices
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: payment-service

PDB 与节点维护

当需要对集群节点进行维护（如升级 K8s 版本、更换硬件）时，PDB 确保驱逐操作不会导致服务可用副本数低于阈值。如果没有 PDB，kubectl drain 可能会一次性驱逐所有 Pod，导致服务中断。

七、监控与日志集成

7.1 Prometheus ServiceMonitor

API 版本说明：ServiceMonitor 使用 monitoring.coreos.com/v1，这是 Prometheus Operator 提供的 CRD，用于定义 Service 的监控目标。详见官方文档：ServiceMonitor CRD

k8s/07-monitoring/servicemonitor.yaml
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: microservices-monitor
  namespace: microservices
  labels:
    release: prometheus  # 匹配 Prometheus Operator 的 serviceMonitorSelector
spec:
  namespaceSelector:
    matchNames:
      - microservices
      - gateway
  selector:
    matchLabels:
      app.kubernetes.io/part-of: ecommerce
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s
      honorLabels: true
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_name]
          targetLabel: pod
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: data-services-monitor
  namespace: data
  labels:
    release: prometheus
spec:
  namespaceSelector:
    matchNames:
      - data
  selector:
    matchLabels:
      app.kubernetes.io/part-of: ecommerce
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

7.2 Grafana Dashboard 配置

k8s/07-monitoring/grafana-dashboard.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ecommerce-grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  microservices-overview.json: |
    {
      "dashboard": {
        "title": "电商微服务总览",
        "panels": [
          {
            "title": "请求 QPS",
            "type": "timeseries",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{namespace=\"microservices\"}[5m])) by (app)"
              }
            ]
          },
          {
            "title": "P99 延迟",
            "type": "timeseries",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=\"microservices\"}[5m])) by (le, app))"
              }
            ]
          },
          {
            "title": "错误率",
            "type": "timeseries",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{namespace=\"microservices\",status=~\"5..\"}[5m])) by (app) / sum(rate(http_requests_total{namespace=\"microservices\"}[5m])) by (app) * 100"
              }
            ]
          },
          {
            "title": "Pod 副本数",
            "type": "stat",
            "targets": [
              {
                "expr": "sum(kube_deployment_status_replicas_available{namespace=\"microservices\"}) by (deployment)"
              }
            ]
          }
        ]
      }
    }

7.3 告警规则

k8s/07-monitoring/alerting-rules.yaml
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ecommerce-alerts
  namespace: microservices
  labels:
    release: prometheus
spec:
  groups:
    - name: microservices.alerts
      rules:
        # 服务不可用告警
        - alert: ServiceDown
          expr: up{namespace="microservices"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "服务 {{ $labels.app }} 不可用"
            description: "{{ $labels.namespace }} 命名空间中的 {{ $labels.instance }} 已下线超过 2 分钟"

        # 高错误率告警
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{namespace="microservices",status=~"5.."}[5m])) by (app)
            / sum(rate(http_requests_total{namespace="microservices"}[5m])) by (app) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "服务 {{ $labels.app }} 错误率过高"
            description: "{{ $labels.app }} 的 5xx 错误率已超过 5%，当前值：{{ $value | humanizePercentage }}"

        # P99 延迟告警
        - alert: HighLatencyP99
          expr: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace="microservices"}[5m])) by (le, app)) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "服务 {{ $labels.app }} P99 延迟过高"
            description: "{{ $labels.app }} 的 P99 延迟已超过 1 秒，当前值：{{ $value }}s"

        # Pod 重启告警
        - alert: PodRestarting
          expr: increase(kube_pod_container_status_restarts_total{namespace="microservices"}[1h]) > 3
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} 频繁重启"
            description: "{{ $labels.namespace }} 中的 {{ $labels.pod }} 在过去 1 小时内重启了 {{ $value }} 次"

        # HPA 达到上限告警
        - alert: HPAAtMaxReplicas
          expr: kube_hpa_status_current_replicas == kube_hpa_status_max_replicas
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "HPA {{ $labels.hpa }} 已达到最大副本数"
            description: "{{ $labels.namespace }} 中的 {{ $labels.hpa }} 已达到最大副本数 {{ $value }}，可能需要调整上限"

7.4 Fluent Bit 日志收集

说明：Fluent Bit 以 DaemonSet 方式部署，确保每个节点运行一个日志采集代理。RBAC 配置使用 rbac.authorization.k8s.io/v1，DaemonSet 使用 apps/v1。详见官方文档：Fluent Bit Kubernetes Filter

k8s/07-monitoring/fluent-bit.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit
rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit
subjects:
  - kind: ServiceAccount
    name: fluent-bit
    namespace: monitoring
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: monitoring
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.0.0
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: config
              mountPath: /fluent-bit/etc/
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: config
          configMap:
            name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: monitoring
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  10
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Parser.On       On
        K8S-Parser.Exclude On

    [OUTPUT]
        Name              elasticsearch
        Match             kube.*
        Host              elasticsearch.monitoring.svc.cluster.local
        Port              9200
        Index             ecommerce-logs
        Type              _doc
        Logstash_Format   On
        Logstash_Prefix   ecommerce
        Retry_Limit       False

  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L

八、CI/CD 集成（ArgoCD）

8.1 GitOps 工作流

GitOps 的核心理念是：Git 仓库是唯一的事实来源（Single Source of Truth）。所有环境变更都通过提交代码来触发，ArgoCD 负责将 Git 仓库中的声明式配置同步到 Kubernetes 集群。

8.2 ArgoCD Application 配置

API 版本说明：ArgoCD 使用 argoproj.io/v1alpha1，这是 ArgoCD 的稳定 API 版本。AppProject、Application 和 ApplicationSet 均使用此版本。详见官方文档：ArgoCD CRD Reference

k8s/08-argocd/argocd-app.yaml
---
# ArgoCD 项目定义
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: ecommerce
  namespace: argocd
spec:
  description: "电商微服务项目"
  sourceRepos:
    - "https://github.com/your-org/ecommerce-k8s.git"
  destinations:
    - namespace: microservices
      server: https://kubernetes.default.svc
    - namespace: data
      server: https://kubernetes.default.svc
    - namespace: gateway
      server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace
    - group: "networking.k8s.io"
      kind: IngressClass
  orphanedResources:
    warn: true
---
# ArgoCD 应用定义
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ecommerce-infra
  namespace: argocd
  labels:
    app.kubernetes.io/part-of: ecommerce
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: ecommerce
  source:
    repoURL: https://github.com/your-org/ecommerce-k8s.git
    targetRevision: main
    path: k8s
    directory:
      recurse: true
      jsonnet: false
  destination:
    server: https://kubernetes.default.svc
    namespace: microservices
  syncPolicy:
    automated:
      prune: true       # 自动删除 Git 中不存在的资源
      selfHeal: true    # 自动修复手动变更
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
      - ServerSideApply=true
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
---
# ArgoCD 应用集（App of Apps 模式）
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: ecommerce-appset
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/your-org/ecommerce-k8s.git
        revision: main
        directories:
          - path: k8s/*
  template:
    metadata:
      name: "{{ path.basename }}"
    spec:
      project: ecommerce
      source:
        repoURL: https://github.com/your-org/ecommerce-k8s.git
        targetRevision: main
        path: "{{ path }}"
      destination:
        server: https://kubernetes.default.svc
        namespace: "{{ path.basename }}"
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

生产环境注意事项

selfHeal: true 要谨慎使用。开启后，任何手动通过 kubectl 修改的配置都会被 ArgoCD 自动覆盖。建议在开发/测试环境开启，生产环境使用手动同步。
prune: true 会删除 Git 中不存在的资源。误删 Git 中的文件可能导致生产环境资源被意外删除。建议配合 syncOptions: PruneLast=true，让 ArgoCD 最后再执行删除操作。
使用 App of Apps 模式管理多应用。通过 ApplicationSet 可以自动发现 Git 仓库中的目录结构，为每个子目录创建一个 Application，避免手动维护大量 Application 资源。

九、请求完整链路

下面展示一个用户下单请求的完整链路，帮助你理解各组件之间的协作关系。

十、部署与验证

10.1 分步部署

按照依赖关系，从底层到上层依次部署：

# 1. 创建命名空间
kubectl apply -f k8s/00-namespace/

# 2. 部署配置与密钥
kubectl apply -f k8s/01-config/

# 3. 创建存储资源
kubectl apply -f k8s/02-storage/

# 4. 部署数据层（等待 PostgreSQL 和 Redis 就绪）
kubectl apply -f k8s/03-services/postgresql-statefulset.yaml
kubectl apply -f k8s/03-services/redis-statefulset.yaml

# 等待数据层就绪
kubectl wait --for=condition=ready pod \
  -l app=postgresql -n data --timeout=120s
kubectl wait --for=condition=ready pod \
  -l app=redis -n data --timeout=120s

# 5. 部署业务服务
kubectl apply -f k8s/03-services/user-service.yaml
kubectl apply -f k8s/03-services/product-service.yaml
kubectl apply -f k8s/03-services/order-service.yaml
kubectl apply -f k8s/03-services/payment-service.yaml

# 6. 部署网关
kubectl apply -f k8s/03-services/gateway-deployment.yaml

# 7. 部署 Ingress
kubectl apply -f k8s/05-ingress/

# 8. 部署弹性伸缩与高可用
kubectl apply -f k8s/06-scalability/

# 9. 部署监控
kubectl apply -f k8s/07-monitoring/

10.2 健康检查验证

# 检查所有命名空间下的 Pod 状态
kubectl get pods --all-namespaces -l app.kubernetes.io/part-of=ecommerce

# 检查各服务 Endpoints
kubectl get endpoints -n microservices
kubectl get endpoints -n data
kubectl get endpoints -n gateway

# 检查 Ingress 状态
kubectl get ingress -n gateway
kubectl describe ingress ecommerce-ingress -n gateway

# 检查 HPA 状态
kubectl get hpa -n microservices
kubectl describe hpa user-service-hpa -n microservices

# 检查 PDB 状态
kubectl get pdb -n microservices

# 检查 PVC 绑定状态
kubectl get pvc -n data

# 检查 ArgoCD 应用状态
kubectl get applications -n argocd

10.3 压力测试

使用 hey 工具对用户服务进行压力测试：

# 进入集群内部执行测试（或通过 port-forward）
kubectl run hey-test --image=williamyeh/hey:latest --rm -it --restart=Never -- \
  -n 100 -c 20 -m POST \
  -H "Content-Type: application/json" \
  -d '{"username":"testuser","password":"testpass"}' \
  http://user-service.microservices.svc.cluster.local:8080/api/v1/users/login

# 模拟并发下单请求
kubectl run hey-test --image=williamyeh/hey:latest --rm -it --restart=Never -- \
  -n 500 -c 50 -m POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{"product_id":1,"quantity":2}' \
  http://order-service.microservices.svc.cluster.local:8082/api/v1/orders

# 观察 HPA 是否触发扩容
kubectl get hpa -n microservices -w

10.4 故障注入测试

# 测试 1：删除 Pod，验证自动恢复
kubectl delete pod -l app=user-service -n microservices --grace-period=0 --force
# 观察 Pod 是否自动重建
kubectl get pods -l app=user-service -n microservices -w

# 测试 2：验证优雅停机（不应出现 5xx 错误）
# 在另一个终端持续发送请求
kubectl run curl-test --image=curlimages/curl:latest --rm -it --restart=Never -- \
  -s -o /dev/null -w "%{http_code}\n" \
  http://user-service.microservices.svc.cluster.local:8080/healthz

# 测试 3：模拟节点故障（需要多节点集群）
kubectl cordon <node-name>        # 标记节点不可调度
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 观察 Pod 是否迁移到其他节点
kubectl get pods -l app=user-service -n microservices -o wide -w

# 测试 4：验证 PDB 保护
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 如果违反 PDB，drain 会被阻止，并提示：
# "error: poddisruptionbudgets policy violation"

十一、生产环境最佳实践总结

经过上述完整的实战演练，以下是生产环境中的关键最佳实践：

架构设计

实践	说明
命名空间隔离	按团队/环境/业务域划分命名空间，配合 ResourceQuota 限制资源
配置外部化	使用 ConfigMap/Secret 管理配置，禁止将配置硬编码在镜像中
密钥加密	使用 Sealed Secrets 或 External Secrets Operator 管理敏感信息
服务网格可选	对于服务间通信复杂度高的场景，考虑引入 Istio/Linkerd

部署策略

实践	说明
滚动更新	`maxSurge: 1, maxUnavailable: 0` 确保更新过程中不中断服务
健康检查三件套	Startup + Liveness + Readiness Probe，使用不同端点
优雅停机	`preStop sleep` + `terminationGracePeriodSeconds` 确保请求处理完成
Pod 反亲和性	确保同一服务的 Pod 分布在不同节点上

弹性伸缩

实践	说明
HPA 多指标	同时关注 CPU、内存和自定义业务指标（如 QPS、延迟）
扩快缩慢	扩容窗口短（30-60s），缩容窗口长（300-600s）
PDB 保护	为关键服务配置 PDB，防止维护操作导致服务不可用
Cluster Autoscaler	配合节点自动伸缩，当 Pod 因资源不足处于 Pending 状态时自动扩容节点

可观测性

实践	说明
三大支柱	Metrics（Prometheus）+ Logs（Fluent Bit）+ Traces（Jaeger/OpenTelemetry）
告警分级	Critical（立即响应）+ Warning（工作时间内处理）+ Info（记录备案）
Dashboard 分层	全局总览 -> 服务维度 -> Pod 维度，逐层下钻
SLO/SLI	定义明确的服务质量目标（如 99.9% 可用性、P99 < 500ms）

持续交付

实践	说明
GitOps	Git 仓库作为唯一事实来源，所有变更通过 PR 审批
环境隔离	dev -> staging -> production 逐级发布，每级有独立的 ArgoCD Application
镜像标签策略	使用 Git Commit SHA 作为镜像 Tag，确保可追溯性
回滚机制	ArgoCD 支持一键回滚到 Git 历史中的任意版本

总结

生产级 Kubernetes 部署不仅仅是"把 YAML 写对"，更是一套涵盖架构设计、部署策略、弹性伸缩、可观测性和持续交付的完整体系。本文通过电商微服务的实战案例，展示了如何将这些最佳实践落地到具体的 YAML 配置中。

希望这篇实战指南能够帮助你在实际项目中少走弯路，构建出真正可靠、可扩展的云原生微服务系统。

Kubernetes 全景解析 (4)：调度器、资源管理与弹性伸缩

2026年4月9日 · 阅读需 21 分钟

Rainy

雨落无声，代码成诗 —— 致力于技术与艺术的极致平衡

前言

在 Kubernetes 集群中，调度器（Scheduler）扮演着"决策中枢"的角色——每一个 Pod 应该运行在哪个节点上，都由它来决定。而资源管理（Resource Management）与弹性伸缩（Autoscaling）则是保障集群高效、稳定运行的关键机制。这三者共同构成了 K8s 资源层的核心能力。

本文将从资源模型出发，深入剖析 kube-scheduler 的调度机制，全面讲解 HPA/VPA/Cluster Autoscaler 三级弹性伸缩策略，并介绍 ResourceQuota、LimitRange、PriorityClass 等资源治理工具，帮助你构建一套完整的 K8s 资源管理知识体系。

一、资源模型基础

1.1 Kubernetes 可管理资源类型

Kubernetes 对容器可以使用的资源进行了抽象，目前支持以下几种核心资源类型：

资源类型	缩写	说明	可压缩
CPU	`cpu`	计算资源，以核心（core）为单位，支持小数（如 `0.5` = 500m）	是
Memory	`memory`	内存资源，以字节为单位（支持 Ki/Mi/Gi）	否
Ephemeral Storage	`ephemeral-storage`	临时存储（日志、EmptyDir 等），以字节为单位	否
Extended Resources	如 `nvidia.com/gpu`	扩展资源（如 GPU、Infiniband），由设备插件注册	否

可压缩 vs 不可压缩资源

可压缩资源（Compressible）：CPU 是可压缩的。当 Pod 超过 CPU Limit 时，不会被杀掉，而是被限流（throttled），表现为性能下降。
不可压缩资源（Incompressible）：内存和存储是不可压缩的。当 Pod 超过 Memory Limit 时，会被 OOM Killer 杀掉并重启。

理解这一区别对于合理设置资源配额至关重要。

1.2 Request vs Limit

Kubernetes 通过 requests 和 limits 两个字段来控制容器的资源使用：

Request（请求量）：调度器依据此值决定将 Pod 调度到哪个节点（保证最低可用资源）
Limit（限制量）：运行时容器可使用的资源上限（硬限制）

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx:1.25
    resources:
      requests:
        cpu: "250m"       # 0.25 核，调度依据
        memory: "256Mi"   # 256 MiB，调度依据
      limits:
        cpu: "500m"       # 0.5 核，运行时上限
        memory: "512Mi"   # 512 MiB，运行时上限，超出则 OOM Kill

Limit 可以省略吗？

如果只设置了 requests 而没有设置 limits，在默认情况下（未配置 LimitRange），limits 默认等于节点可分配资源上限，这意味着容器可以无限使用资源。在生产环境中，强烈建议同时设置 requests 和 limits。

1.3 QoS 服务质量类别

Kubernetes 根据资源配额的设置方式，将 Pod 分为三个 QoS（Quality of Service）等级。当节点资源不足时，K8s 会优先驱逐低 QoS 的 Pod。

三个 QoS 等级的详细对比：

QoS 等级	条件	CPU 行为	内存行为	驱逐优先级
Guaranteed	所有容器 Request == Limit（CPU + Memory）	稳定，不会被限流	超限则 OOM Kill	最低（最后驱逐）
Burstable	至少一个容器设置了 Request 或 Limit（但不满足 Guaranteed）	可能被限流	超限则 OOM Kill	中等
BestEffort	所有容器均未设置 Request 和 Limit	最先被限流	最先被 OOM Kill	最高（最先驱逐）

生产环境建议

对于核心业务（如支付、订单），建议设置为 Guaranteed 级别，确保资源独占；对于一般业务（如日志处理），设置为 Burstable 即可；BestEffort 仅适用于测试或临时任务。

二、kube-scheduler：调度器深度解析

2.1 调度流程概览

kube-scheduler 是 Kubernetes 的核心组件之一，负责将新创建的、未调度的 Pod 分配到合适的节点上。根据官方文档，kube-scheduler 通过**两步操作（2-step operation）**完成节点选择：

kube-scheduler selects a node for the pod in a 2-step operation: 1. Filtering 2. Scoring —— Kubernetes Scheduler 官方文档

Scheduling Profiles vs Scheduling Policies

官方文档指出，有两种方式配置调度器的过滤和打分行为：

Scheduling Profiles（推荐）：通过配置插件（Plugins）实现不同调度阶段，包括 QueueSort、Filter、Score、Bind、Reserve、Permit 等扩展点。
Scheduling Policies（旧版方式）：通过配置 Predicates（过滤）和 Priorities（打分）来定义调度策略。

2.2 过滤阶段（Filtering）

过滤阶段的目标是快速排除不满足条件的节点，缩小候选范围。通过过滤的节点称为可行节点（Feasible Nodes）。以下是常见的过滤策略：

过滤策略	说明	配置方式
NodeName	直接指定节点名称，跳过所有调度逻辑	`spec.nodeName`
nodeSelector	基于节点标签的简单匹配	`spec.nodeSelector`
nodeAffinity	基于节点标签的高级匹配（支持软约束）	`spec.affinity.nodeAffinity`
podAffinity	Pod 亲和性，倾向于与某些 Pod 部署在一起	`spec.affinity.podAffinity`
podAntiAffinity	Pod 反亲和性，倾向于远离某些 Pod	`spec.affinity.podAntiAffinity`
Taint/Toleration	节点污点与 Pod 容忍度匹配	`spec.tolerations`
资源检查	节点剩余资源是否满足 Pod 的 Request	自动
Volume 检查	节点是否能挂载 Pod 所需的存储卷	自动

2.3 打分阶段（Scoring）

经过过滤后，调度器对剩余的候选节点进行打分，选择得分最高的节点。常见的打分策略包括：

打分策略	权重	说明
NodeResourcesFit	默认 1	资源均衡分配，倾向于选择资源使用率更均衡的节点
NodeAffinity	默认 1	满足节点亲和性软约束的节点获得加分
PodAffinity	默认 1	满足 Pod 亲和性软约束的节点获得加分
ImageLocality	默认 1	节点上已存在 Pod 所需镜像时加分（减少拉取时间）
TaintToleration	默认 1	容忍节点污点的 Pod 获得加分
InterPodAffinity	默认 1	Pod 间亲和/反亲和性评分
VolumeBinding	默认 1	PV/PVC 绑定评分

自定义调度器与 Scheduler Framework

Kubernetes 允许你编写自定义调度器（Custom Scheduler），通过实现 Scheduler Framework 的扩展点（Extension Points）来插入自定义逻辑。Scheduling Profiles 配置中支持的插件阶段包括：QueueSort、Filter、Score、Bind、Reserve、Permit、PreBind、PostBind 等。更多详情请参阅 kube-scheduler 配置参考。

2.4 nodeSelector：简单节点选择

nodeSelector 是最简单的节点选择方式，通过标签匹配来约束 Pod 的调度目标：

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  nodeSelector:
    gpu: "nvidia-tesla-t4"    # 只调度到具有此标签的节点
    disktype: "ssd"
  containers:
  - name: cuda-app
    image: nvidia/cuda:12.0-base
    resources:
      limits:
        nvidia.com/gpu: "1"    # 申请 1 块 GPU

2.5 nodeAffinity：高级节点亲和性

nodeAffinity 支持更灵活的匹配规则，包括硬约束（required） 和 软约束（preferred）：

apiVersion: v1
kind: Pod
metadata:
  name: affinity-pod
spec:
  affinity:
    nodeAffinity:
      # 硬约束：必须满足，否则 Pod 无法调度
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-east-1a", "us-east-1b"]
          - key: node-type
            operator: NotIn
            values: ["spot-instance"]
      # 软约束：尽量满足，不满足也能调度
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node-role.kubernetes.io/worker
            operator: In
            values: ["high-memory"]
  containers:
  - name: app
    image: nginx:1.25

操作符说明

nodeAffinity 支持的操作符：In、NotIn、Exists、DoesNotExist、Gt、Lt。其中 Gt 和 Lt 仅用于数值比较。

2.6 podAffinity 与 podAntiAffinity

Pod 亲和性用于控制 Pod 之间的部署位置关系：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        # Pod 亲和性：与 cache Pod 部署在同一可用区
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: cache
            topologyKey: topology.kubernetes.io/zone
        # Pod 反亲和性：不同副本尽量分布在不同节点
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: web
              topologyKey: kubernetes.io/hostname
      containers:
      - name: web
        image: nginx:1.25
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"

topologyKey 的重要性

topologyKey 是亲和性规则的关键字段，它定义了拓扑域的划分维度。常用的值包括：

kubernetes.io/hostname：节点级别
topology.kubernetes.io/zone：可用区级别
topology.kubernetes.io/region：区域级别

选择合适的 topologyKey 对于实现高可用部署至关重要。

2.7 Taint 与 Toleration：污点容忍机制

Taint（污点）是作用于节点上的标记，用于排斥不容忍该污点的 Pod。Toleration（容忍度）是 Pod 上的声明，表示可以容忍特定的污点。

# 给节点添加污点
# kubectl taint nodes node1 dedicated=gpu:NoSchedule

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  # 声明容忍度
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"
  containers:
  - name: cuda-app
    image: nvidia/cuda:12.0-base

Taint 的 effect 类型：

Effect	说明
NoSchedule	不调度新的不容忍 Pod（已运行的 Pod 不受影响）
PreferNoSchedule	尽量不调度（软约束）
NoExecute	不调度，且驱逐已运行的不容忍 Pod

常见内置 Taint

Kubernetes 自动为特定场景的节点添加 Taint：

node.kubernetes.io/not-ready：节点未就绪
node.kubernetes.io/unreachable：节点不可达
node.kubernetes.io/memory-pressure：内存压力
node.kubernetes.io/disk-pressure：磁盘压力
node.kubernetes.io/network-unavailable：网络不可用
node.kubernetes.io/unschedulable：节点被标记为不可调度（cordon）

三、HPA：水平 Pod 自动伸缩

3.1 HPA 工作原理

Horizontal Pod Autoscaler（HPA）通过监控 Pod 的资源使用指标，自动调整 Deployment/ReplicaSet/StatefulSet 的副本数量，使应用能够根据负载变化自动扩缩。

3.2 扩缩算法详解

HPA 的核心算法如下：

期望副本数 = ceil(当前副本数 × (当前指标值 / 目标指标值))

算法示例

假设 Deployment 当前有 4 个副本，目标 CPU 利用率为 50%：

当前平均 CPU 利用率为 80% → 期望副本数 = ceil(4 × 80/50) = ceil(6.4) = 7
当前平均 CPU 利用率为 20% → 期望副本数 = ceil(4 × 20/50) = ceil(1.6) = 2

注意：HPA 不会将副本数缩减到低于 spec.replicas.minReplicas（默认为 1）。

3.3 指标类型

HPA 支持四种指标类型：

指标类型	数据来源	示例
Resource	Metrics Server（CPU/内存）	CPU 利用率 50%
Container Resource	Metrics Server（容器级别）	容器内存使用量
Object	自定义指标（如 Ingress QPS）	每秒请求数
External	外部指标系统（如 Prometheus）	Kafka 队列积压消息数

3.4 行为配置

从 Kubernetes 1.18 开始，HPA 引入了 behavior 字段，允许精细控制扩缩行为：

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: "500Mi"
  behavior:
    # 缩容冷却窗口与策略
    scaleDown:
      stabilizationWindowSeconds: 300   # 5 分钟内不重复缩容
      policies:
      - type: Percent
        value: 10                        # 每次最多缩容 10%
        periodSeconds: 60                # 每 60 秒允许一次缩容
      - type: Pods
        value: 2                         # 或每次最多缩容 2 个 Pod
        periodSeconds: 60
      selectPolicy: Min                  # 取两种策略中更保守的
    # 扩容策略
    scaleUp:
      stabilizationWindowSeconds: 0      # 扩容不需要冷却
      policies:
      - type: Percent
        value: 100                       # 允许一次扩容 100%
        periodSeconds: 15                # 每 15 秒允许一次扩容
      - type: Pods
        value: 4                         # 或每次最多扩容 4 个 Pod
        periodSeconds: 15
      selectPolicy: Max                  # 取两种策略中更激进的

扩缩行为建议

扩容应该快：用户不希望等待太久，建议设置较短的 periodSeconds（如 15s）
缩容应该慢：避免频繁缩容导致的服务抖动，建议设置较长的 stabilizationWindowSeconds（如 300s）
使用 selectPolicy：扩容选 Max（激进），缩容选 Min（保守）

四、VPA：垂直 Pod 自动伸缩

4.1 VPA 工作模式

Vertical Pod Autoscaler（VPA）通过调整 Pod 的 CPU 和内存 Request/Limit 来实现垂直伸缩。VPA 支持三种工作模式：

模式	说明	适用场景
Auto	自动更新 Pod 的资源配额（会重启 Pod）	非关键业务，可容忍重启
Recreate	自动更新，且在更新时重启 Pod	同 Auto
Off	仅提供建议，不执行变更	评估阶段，收集数据

4.2 VPA 与 HPA 的协作与冲突

VPA 与 HPA 的冲突

VPA 和 HPA 不能同时基于 CPU/内存指标工作。原因很简单：HPA 通过增加副本数来降低单个 Pod 的资源使用率，而 VPA 通过增加单个 Pod 的资源配额来满足需求。两者同时基于 CPU/内存工作会导致"扩容循环"——HPA 增加 Pod 数量导致单个 Pod 负载降低，VPA 随之降低资源配额，然后 HPA 又需要更多 Pod 来满足需求。

解决方案：HPA 基于 CPU/内存指标伸缩，VPA 仅基于自定义指标（如内存使用量）工作；或者使用 HPA 基于 CPU 伸缩，VPA 仅调整内存。

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"       # Auto / Recreate / Off
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "2"
        memory: "2Gi"
      controlledResources: ["cpu", "memory"]
  recommenders:
  - name: custom-recommender  # 可自定义推荐器

五、Cluster Autoscaler：节点级伸缩

5.1 CA 工作原理

Cluster Autoscaler（CA）负责在集群级别自动调整节点数量。当 Pod 因资源不足而无法调度（Pending）时，CA 会自动添加新节点；当节点利用率持续偏低时，CA 会自动移除空闲节点。

CA 的核心工作逻辑：

扩容触发：检测到有 Pod 处于 Pending 状态，且原因是资源不足
评估节点组：根据 Pod 的资源需求、节点亲和性等约束，选择合适的节点组
创建节点：调用云厂商 API 创建新节点
缩容触发：节点利用率持续低于阈值（默认 50%），且节点上的 Pod 可以被迁移到其他节点
驱逐 Pod：安全驱逐节点上的 Pod（非 Critical Pod、非 DaemonSet Pod）
删除节点：调用云厂商 API 删除节点

5.2 CA 与 HPA 的配合

CA 和 HPA 的配合构成了 Kubernetes 的两级弹性伸缩体系：

层级	组件	伸缩维度	响应速度	粒度
第一级	HPA	Pod 数量	秒级	细粒度
第二级	Cluster Autoscaler	节点数量	分钟级	粗粒度

典型工作流程：负载增加 → HPA 增加 Pod 副本 → 节点资源不足 → Pod Pending → CA 添加新节点 → 新 Pod 被调度到新节点。

5.3 云厂商集成

CA 需要与云厂商的 API 集成来管理节点池：

云厂商	节点组实现	CA 配置
AWS	ASG（Auto Scaling Group） / EKS Managed Node Group	`--nodegroup-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled`
GCP	MIG（Managed Instance Group） / GKE Node Pool	自动发现
Azure	VMSS（Virtual Machine Scale Set） / AKS Node Pool	`--nodegroup-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled`
阿里云	ECI / ACK 节点池	`--nodegroup-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled`

CA 的安全限制

Cluster Autoscaler 不会驱逐以下 Pod：

通过 PDB（PodDisruptionBudget）保护的 Pod（驱逐会违反最小可用数）
没有被 Controller 管理的裸 Pod（Bare Pod）
使用本地存储的 Pod（数据会随节点删除而丢失）
在 kube-system 命名空间中且不是 DaemonSet 管理的 Pod
配置了 "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" 注解的 Pod

六、资源配额与限制

6.1 ResourceQuota：命名空间级资源配额

ResourceQuota 用于限制一个命名空间中可以创建的资源总量，防止某个团队或项目占用过多集群资源。

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    # 计算资源配额
    requests.cpu: "20"          # 该命名空间最多请求 20 核 CPU
    requests.memory: "40Gi"     # 最多请求 40Gi 内存
    limits.cpu: "40"            # 最多限制 40 核 CPU
    limits.memory: "80Gi"       # 最多限制 80Gi 内存
    # 对象数量配额
    pods: "50"                  # 最多 50 个 Pod
    services: "10"              # 最多 10 个 Service
    persistentvolumeclaims: "20" # 最多 20 个 PVC
    configmaps: "20"            # 最多 20 个 ConfigMap
    secrets: "20"               # 最多 20 个 Secret
    # 存储
    requests.storage: "100Gi"   # 最多请求 100Gi 存储

ResourceQuota 的作用范围

ResourceQuota 只限制设置了 requests 和 limits 的 Pod。如果一个 Pod 没有设置资源请求，它不会计入 ResourceQuota 的配额，但也不会被调度（除非命名空间配置了 LimitRange 来提供默认值）。

6.2 LimitRange：默认资源限制

LimitRange 用于为命名空间中的 Pod 或 Container 设置默认的资源请求和限制值，确保即使 Pod 的 YAML 中没有声明资源，也能获得合理的默认值。

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
  # Container 级别默认值
  - type: Container
    default:                    # 默认 Limit
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:             # 默认 Request
      cpu: "100m"
      memory: "128Mi"
    max:                        # 最大允许值
      cpu: "2"
      memory: "2Gi"
    min:                        # 最小允许值
      cpu: "50m"
      memory: "64Mi"
  # Pod 级别限制
  - type: Pod
    max:
      cpu: "4"
      memory: "4Gi"

6.3 PriorityClass：优先级与抢占

PriorityClass 用于定义 Pod 的优先级。当集群资源不足时，高优先级的 Pod 可以抢占低优先级 Pod 的资源。

# 定义高优先级类
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
description: "核心业务 Pod，优先调度"
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority

# 定义低优先级类
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
description: "批处理任务，资源空闲时运行"
value: 100
globalDefault: false
preemptionPolicy: PreemptLowerPriority

# 使用优先级类
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: nginx:1.25
    resources:
      requests:
        cpu: "1"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "2Gi"

抢占机制

当高优先级 Pod 无法调度时，调度器会尝试驱逐低优先级 Pod 来释放资源。抢占过程分为两步：

驱逐（Eviction）：找到可以抢占的低优先级 Pod
等待（Waiting）：等待被驱逐的 Pod 优雅终止（graceful shutdown）

系统保留的 PriorityClass：

system-cluster-critical（值 2000000000）：系统组件专用
system-node-critical（值 2000001000）：节点关键组件专用（如 kube-proxy）

七、调度策略最佳实践

7.1 生产环境调度建议

场景	建议策略
核心业务	Guaranteed QoS + nodeAffinity 硬约束 + 高优先级 + PDB 保护
批处理任务	BestEffort/Burstable QoS + 低优先级 + 反亲和性避免影响在线业务
GPU 工作负载	nodeSelector + Taint/Toleration + ResourceQuota 限制 GPU 数量
高可用部署	podAntiAffinity（hostname 级别）+ topologySpreadConstraints
多可用区部署	nodeAffinity（zone 级别）+ topologySpreadConstraints

7.2 常见调度问题排查

问题现象	可能原因	排查命令
Pod 一直 Pending	资源不足 / 亲和性不满足 / Taint 不匹配	`kubectl describe pod <name>`
Pod 频繁被驱逐	节点资源压力 / QoS 等级过低	`kubectl get events --sort-by=.metadata.creationTimestamp`
节点资源不均衡	缺少反亲和性配置 / 打分权重不合理	`kubectl top nodes`
HPA 不生效	未安装 Metrics Server / 指标未就绪	`kubectl get hpa -w`
VPA 与 HPA 冲突	同时基于 CPU/内存工作	检查 VPA 和 HPA 的指标配置

排查 Pending Pod 的黄金法则

当 Pod 处于 Pending 状态时，第一步永远是执行 kubectl describe pod <name>，查看 Events 部分。调度器会在 Events 中详细记录过滤失败的原因，例如：

0/5 nodes are available: 1 Insufficient cpu, 3 node(s) had taint {node.kubernetes.io/disk-pressure:}, that the pod didn't tolerate, 1 node(s) didn't match node selector.

这条信息清晰地告诉你：1 个节点 CPU 不足，3 个节点有磁盘压力污点，1 个节点不匹配 nodeSelector。

八、官方文档参考

本文内容基于 Kubernetes v1.34 官方文档校验，以下是相关官方文档链接：

调度相关

Kubernetes Scheduler - 调度器核心概念与 Filtering/Scoring 两步操作
Assigning Pods to Nodes - nodeSelector、亲和性、拓扑分布约束
Pod Topology Spread Constraints - 拓扑分布约束
Pod Priority and Preemption - Pod 优先级与抢占
Pod Overhead - Pod 开销资源

资源管理相关

Resource Management for Pods and Containers - Request/Limit 与 QoS
Configure Pod Quality of Service - QoS 配置详解
Horizontal Pod Autoscaling - HPA 官方指南
Vertical Pod Autoscaling - VPA 官方指南
Cluster Autoscaler - 集群自动伸缩

配额与限制

Resource Quotas - 命名空间资源配额
Limit Ranges - 默认资源限制
Limit Priority Class consumption by default - 限制 PriorityClass 使用

九、本章小结

本文深入剖析了 Kubernetes 调度与资源管理的核心机制，主要知识点回顾如下：

资源模型：K8s 通过 Request/Limit 控制容器的资源使用，QoS 等级（Guaranteed/Burstable/BestEffort）决定了资源不足时的驱逐顺序。
调度器：kube-scheduler 通过官方定义的"Filtering → Scoring"两步操作完成节点选择，随后通过 Binding 将 Pod 绑定到目标节点。nodeSelector/nodeAffinity/podAffinity/Taint-Toleration 提供了灵活的调度约束。
弹性伸缩：
- HPA：水平伸缩，调整副本数量，响应最快
- VPA：垂直伸缩，调整资源配额，适合优化资源利用率
- Cluster Autoscaler：节点级伸缩，配合 HPA 实现完整的弹性能力
资源治理：ResourceQuota 限制命名空间资源总量，LimitRange 提供默认资源值，PriorityClass 实现优先级与抢占。
最佳实践：核心业务使用 Guaranteed QoS + 高优先级 + PDB 保护；批处理任务使用低优先级 + 反亲和性；多可用区部署使用 topologySpreadConstraints。

掌握这些机制，你就能在生产环境中构建一个高效、稳定、弹性的 Kubernetes 资源管理体系。

一、实战场景概述​

1.1 电商微服务架构设计​

1.2 微服务架构拓扑​

1.3 技术栈选择​

二、基础设施准备​

2.1 命名空间与资源配额​

2.2 ConfigMap 与 Secret 配置管理​

2.3 PV/PVC 持久化存储规划​

三、核心服务部署​

3.1 API Gateway 部署​

3.2 用户服务部署​

3.3 订单服务部署​

3.4 商品服务部署​

3.5 支付服务部署​

3.6 数据库部署（StatefulSet）​

四、服务间通信与配置​

4.1 CoreDNS 服务发现​

4.2 健康检查设计原则​

4.3 优雅停机​

五、Ingress 七层路由配置​

5.1 Ingress Controller 部署​

5.2 Ingress 路由规则​

六、弹性伸缩与高可用​

6.1 HPA 弹性伸缩​

6.2 Pod 反亲和性与 PDB​

七、监控与日志集成​

7.1 Prometheus ServiceMonitor​

7.2 Grafana Dashboard 配置​

7.3 告警规则​

7.4 Fluent Bit 日志收集​

八、CI/CD 集成（ArgoCD）​

8.1 GitOps 工作流​

8.2 ArgoCD Application 配置​

九、请求完整链路​

十、部署与验证​

10.1 分步部署​

10.2 健康检查验证​

10.3 压力测试​

10.4 故障注入测试​

十一、生产环境最佳实践总结​

架构设计​

部署策略​

弹性伸缩​

可观测性​

持续交付​

前言​

一、资源模型基础​

1.1 Kubernetes 可管理资源类型​

1.2 Request vs Limit​

1.3 QoS 服务质量类别​

二、kube-scheduler：调度器深度解析​

2.1 调度流程概览​

2.2 过滤阶段（Filtering）​

2.3 打分阶段（Scoring）​

2.4 nodeSelector：简单节点选择​

2.5 nodeAffinity：高级节点亲和性​

2.6 podAffinity 与 podAntiAffinity​

2.7 Taint 与 Toleration：污点容忍机制​

三、HPA：水平 Pod 自动伸缩​

3.1 HPA 工作原理​

3.2 扩缩算法详解​

3.3 指标类型​

3.4 行为配置​

四、VPA：垂直 Pod 自动伸缩​

4.1 VPA 工作模式​

4.2 VPA 与 HPA 的协作与冲突​

五、Cluster Autoscaler：节点级伸缩​

5.1 CA 工作原理​

5.2 CA 与 HPA 的配合​

5.3 云厂商集成​

六、资源配额与限制​

6.1 ResourceQuota：命名空间级资源配额​

6.2 LimitRange：默认资源限制​

6.3 PriorityClass：优先级与抢占​

七、调度策略最佳实践​

7.1 生产环境调度建议​

7.2 常见调度问题排查​

八、官方文档参考​

调度相关​

资源管理相关​

一、实战场景概述

1.1 电商微服务架构设计

1.2 微服务架构拓扑

1.3 技术栈选择

二、基础设施准备

2.1 命名空间与资源配额

2.2 ConfigMap 与 Secret 配置管理

2.3 PV/PVC 持久化存储规划

三、核心服务部署

3.1 API Gateway 部署

3.2 用户服务部署

3.3 订单服务部署

3.4 商品服务部署

3.5 支付服务部署

3.6 数据库部署（StatefulSet）

四、服务间通信与配置

4.1 CoreDNS 服务发现

4.2 健康检查设计原则

4.3 优雅停机

五、Ingress 七层路由配置

5.1 Ingress Controller 部署

5.2 Ingress 路由规则

六、弹性伸缩与高可用

6.1 HPA 弹性伸缩

6.2 Pod 反亲和性与 PDB

七、监控与日志集成

7.1 Prometheus ServiceMonitor

7.2 Grafana Dashboard 配置

7.3 告警规则

7.4 Fluent Bit 日志收集

八、CI/CD 集成（ArgoCD）

8.1 GitOps 工作流

8.2 ArgoCD Application 配置

九、请求完整链路

十、部署与验证

10.1 分步部署

10.2 健康检查验证

10.3 压力测试

10.4 故障注入测试

十一、生产环境最佳实践总结

架构设计

部署策略

弹性伸缩

可观测性

持续交付

前言

一、资源模型基础

1.1 Kubernetes 可管理资源类型

1.2 Request vs Limit

1.3 QoS 服务质量类别

二、kube-scheduler：调度器深度解析

2.1 调度流程概览

2.2 过滤阶段（Filtering）

2.3 打分阶段（Scoring）

2.4 nodeSelector：简单节点选择

2.5 nodeAffinity：高级节点亲和性

2.6 podAffinity 与 podAntiAffinity

2.7 Taint 与 Toleration：污点容忍机制

三、HPA：水平 Pod 自动伸缩

3.1 HPA 工作原理

3.2 扩缩算法详解

3.3 指标类型

3.4 行为配置

四、VPA：垂直 Pod 自动伸缩

4.1 VPA 工作模式

4.2 VPA 与 HPA 的协作与冲突

五、Cluster Autoscaler：节点级伸缩

5.1 CA 工作原理

5.2 CA 与 HPA 的配合

5.3 云厂商集成

六、资源配额与限制

6.1 ResourceQuota：命名空间级资源配额

6.2 LimitRange：默认资源限制

6.3 PriorityClass：优先级与抢占

七、调度策略最佳实践

7.1 生产环境调度建议

7.2 常见调度问题排查

八、官方文档参考

调度相关

资源管理相关