一、概述
1.1 背景介紹
Pod調(diào)度是Kubernetes的核心機(jī)制之一,決定了Pod最終運(yùn)行在哪個(gè)節(jié)點(diǎn)上。默認(rèn)調(diào)度器kube-scheduler通過一系列預(yù)選(Filtering)和優(yōu)選(Scoring)算法完成調(diào)度決策,但默認(rèn)行為在生產(chǎn)環(huán)境中往往不夠用。
實(shí)際場(chǎng)景中經(jīng)常遇到的問題:數(shù)據(jù)庫Pod被調(diào)度到了沒有SSD的節(jié)點(diǎn)上,導(dǎo)致IO性能差;兩個(gè)高負(fù)載服務(wù)的Pod被調(diào)度到同一個(gè)節(jié)點(diǎn),互相搶資源;GPU節(jié)點(diǎn)上跑了一堆普通業(yè)務(wù)Pod,真正需要GPU的任務(wù)反而調(diào)度不上去。
這些問題都需要通過調(diào)度策略來解決。Kubernetes提供了nodeSelector、nodeAffinity、podAffinity/podAntiAffinity、taints/tolerations、topologySpreadConstraints等多種調(diào)度機(jī)制,本文逐一講解并給出生產(chǎn)環(huán)境的配置方案。
1.2 技術(shù)特點(diǎn)
多層調(diào)度控制:從簡單的nodeSelector到復(fù)雜的自定義調(diào)度器,提供不同粒度的調(diào)度控制能力
軟硬約束結(jié)合:requiredDuringScheduling是硬約束(不滿足就不調(diào)度),preferredDuringScheduling是軟約束(盡量滿足,不滿足也能調(diào)度)
拓?fù)涓兄?/strong>:topologySpreadConstraints支持按zone、node、rack等拓?fù)溆蚍稚od,實(shí)現(xiàn)跨故障域部署
搶占機(jī)制:PriorityClass支持高優(yōu)先級(jí)Pod搶占低優(yōu)先級(jí)Pod的資源
1.3 適用場(chǎng)景
場(chǎng)景一:將數(shù)據(jù)庫、緩存等IO密集型Pod調(diào)度到SSD節(jié)點(diǎn),計(jì)算密集型Pod調(diào)度到高CPU節(jié)點(diǎn)
場(chǎng)景二:同一服務(wù)的多個(gè)副本分散到不同節(jié)點(diǎn)/可用區(qū),避免單點(diǎn)故障導(dǎo)致服務(wù)全部不可用
場(chǎng)景三:GPU、FPGA等特殊硬件資源的獨(dú)占調(diào)度,防止普通Pod占用專用資源
場(chǎng)景四:多租戶集群中,不同團(tuán)隊(duì)的Pod隔離到各自的節(jié)點(diǎn)池
1.4 環(huán)境要求
| 組件 | 版本要求 | 說明 |
|---|---|---|
| Kubernetes | 1.24+ | topologySpreadConstraints在1.19 GA,PodSecurity在1.25 GA |
| kube-scheduler | 與集群版本一致 | 自定義調(diào)度器需要單獨(dú)部署 |
| 節(jié)點(diǎn)標(biāo)簽 | 提前規(guī)劃 | 調(diào)度策略依賴節(jié)點(diǎn)標(biāo)簽,需要統(tǒng)一標(biāo)簽規(guī)范 |
| metrics-server | 0.6+ | 資源感知調(diào)度需要metrics數(shù)據(jù) |
二、詳細(xì)步驟
2.1 準(zhǔn)備工作
2.1.1 節(jié)點(diǎn)標(biāo)簽規(guī)劃
調(diào)度策略的基礎(chǔ)是節(jié)點(diǎn)標(biāo)簽,先把標(biāo)簽體系規(guī)劃好:
# 查看現(xiàn)有節(jié)點(diǎn)標(biāo)簽 kubectl get nodes --show-labels # 按硬件類型打標(biāo)簽 kubectl label node k8s-worker-01 disktype=ssd kubectl label node k8s-worker-02 disktype=ssd kubectl label node k8s-worker-03 disktype=hdd # 按業(yè)務(wù)用途打標(biāo)簽 kubectl label node k8s-worker-01 workload-type=database kubectl label node k8s-worker-02 workload-type=application kubectl label node k8s-worker-03 workload-type=application # 按可用區(qū)打標(biāo)簽(如果是多機(jī)房部署) kubectl label node k8s-worker-01 topology.kubernetes.io/zone=zone-a kubectl label node k8s-worker-02 topology.kubernetes.io/zone=zone-b kubectl label node k8s-worker-03 topology.kubernetes.io/zone=zone-c # GPU節(jié)點(diǎn)標(biāo)簽 kubectl label node k8s-gpu-01 accelerator=nvidia-tesla-v100 kubectl label node k8s-gpu-02 accelerator=nvidia-tesla-a100 # 驗(yàn)證標(biāo)簽 kubectl get nodes -L disktype,workload-type,topology.kubernetes.io/zone
注意:標(biāo)簽key的命名要有規(guī)范,建議用
2.1.2 理解調(diào)度流程
kube-scheduler的調(diào)度流程分為兩個(gè)階段:
預(yù)選(Filtering):過濾掉不滿足條件的節(jié)點(diǎn),比如資源不足、nodeSelector不匹配、taint不容忍等
優(yōu)選(Scoring):對(duì)通過預(yù)選的節(jié)點(diǎn)打分,選擇得分最高的節(jié)點(diǎn)
# 查看scheduler的調(diào)度日志(需要提高日志級(jí)別) # 修改/etc/kubernetes/manifests/kube-scheduler.yaml # 在command中添加 --v=4 # 然后查看日志 kubectl logs -n kube-system kube-scheduler-k8s-master-01 --tail=50
2.1.3 調(diào)度策略優(yōu)先級(jí)
多種調(diào)度策略同時(shí)存在時(shí)的生效順序:
nodeName(最高優(yōu)先級(jí)):直接指定節(jié)點(diǎn)名,跳過調(diào)度器
taints/tolerations:節(jié)點(diǎn)污點(diǎn)過濾,不容忍的Pod直接排除
nodeSelector:簡單的標(biāo)簽匹配過濾
nodeAffinity:更靈活的節(jié)點(diǎn)親和性規(guī)則
podAffinity/podAntiAffinity:Pod間的親和/反親和
topologySpreadConstraints:拓?fù)浞稚⒓s束
資源請(qǐng)求:節(jié)點(diǎn)剩余資源是否滿足Pod的requests
2.2 核心配置
2.2.1 nodeSelector(最簡單的調(diào)度約束)
nodeSelector是最基礎(chǔ)的調(diào)度方式,通過標(biāo)簽鍵值對(duì)匹配節(jié)點(diǎn):
# 文件:nginx-nodeselector.yaml apiVersion:apps/v1 kind:Deployment metadata: name:nginx-ssd namespace:default spec: replicas:3 selector: matchLabels: app:nginx-ssd template: metadata: labels: app:nginx-ssd spec: nodeSelector: disktype:ssd containers: -name:nginx image:nginx:1.24 resources: requests: cpu:100m memory:128Mi limits: cpu:200m memory:256Mi
kubectl apply -f nginx-nodeselector.yaml # 驗(yàn)證Pod只調(diào)度到了ssd節(jié)點(diǎn) kubectl get pods -l app=nginx-ssd -o wide
注意:nodeSelector是硬約束,如果沒有節(jié)點(diǎn)匹配標(biāo)簽,Pod會(huì)一直Pending。生產(chǎn)環(huán)境建議配合nodeAffinity的軟約束使用。
2.2.2 nodeAffinity(節(jié)點(diǎn)親和性)
nodeAffinity比nodeSelector更靈活,支持多種操作符和軟硬約束:
# 文件:app-node-affinity.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:app-with-affinity
namespace:default
spec:
replicas:6
selector:
matchLabels:
app:app-affinity
template:
metadata:
labels:
app:app-affinity
spec:
affinity:
nodeAffinity:
# 硬約束:必須調(diào)度到zone-a或zone-b
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:topology.kubernetes.io/zone
operator:In
values:
-zone-a
-zone-b
# 軟約束:優(yōu)先調(diào)度到ssd節(jié)點(diǎn),權(quán)重1-100
preferredDuringSchedulingIgnoredDuringExecution:
-weight:80
preference:
matchExpressions:
-key:disktype
operator:In
values:
-ssd
-weight:20
preference:
matchExpressions:
-key:workload-type
operator:In
values:
-application
containers:
-name:app
image:nginx:1.24
resources:
requests:
cpu:200m
memory:256Mi
操作符說明:
In:標(biāo)簽值在列表中
NotIn:標(biāo)簽值不在列表中
Exists:標(biāo)簽存在(不關(guān)心值)
DoesNotExist:標(biāo)簽不存在
Gt:標(biāo)簽值大于指定值(僅限數(shù)字)
Lt:標(biāo)簽值小于指定值(僅限數(shù)字)
注意:requiredDuringSchedulingIgnoredDuringExecution中的IgnoredDuringExecution表示Pod已經(jīng)運(yùn)行后,即使節(jié)點(diǎn)標(biāo)簽變了也不會(huì)驅(qū)逐Pod。Kubernetes計(jì)劃實(shí)現(xiàn)RequiredDuringExecution但目前還沒有。
2.2.3 podAffinity和podAntiAffinity(Pod間親和/反親和)
控制Pod之間的調(diào)度關(guān)系,典型場(chǎng)景:Web應(yīng)用和緩存部署在同一節(jié)點(diǎn)減少網(wǎng)絡(luò)延遲,同一服務(wù)的多個(gè)副本分散到不同節(jié)點(diǎn)。
# 文件:web-cache-affinity.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:web-frontend
namespace:default
spec:
replicas:3
selector:
matchLabels:
app:web-frontend
template:
metadata:
labels:
app:web-frontend
spec:
affinity:
# Pod親和:和redis-cache部署在同一節(jié)點(diǎn)
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
-weight:100
podAffinityTerm:
labelSelector:
matchExpressions:
-key:app
operator:In
values:
-redis-cache
topologyKey:kubernetes.io/hostname
# Pod反親和:同一服務(wù)的副本分散到不同節(jié)點(diǎn)
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchExpressions:
-key:app
operator:In
values:
-web-frontend
topologyKey:kubernetes.io/hostname
containers:
-name:web
image:nginx:1.24
resources:
requests:
cpu:200m
memory:256Mi
說明:
topologyKey: kubernetes.io/hostname表示以節(jié)點(diǎn)為拓?fù)溆?,同一?jié)點(diǎn)上的Pod視為同一拓?fù)溆?/p>
topologyKey: topology.kubernetes.io/zone表示以可用區(qū)為拓?fù)溆?/p>
podAntiAffinity的硬約束會(huì)限制副本數(shù)不能超過節(jié)點(diǎn)數(shù),3個(gè)副本至少需要3個(gè)節(jié)點(diǎn)
警告:podAffinity/podAntiAffinity的計(jì)算復(fù)雜度是O(N^2),N是集群中的Pod數(shù)量。在Pod數(shù)量超過5000的大集群中,大量使用podAffinity會(huì)導(dǎo)致調(diào)度延遲從毫秒級(jí)上升到秒級(jí)。
2.2.4 Taints和Tolerations(污點(diǎn)和容忍)
Taints從節(jié)點(diǎn)角度排斥Pod,Tolerations從Pod角度容忍污點(diǎn)。兩者配合實(shí)現(xiàn)節(jié)點(diǎn)專用化。
# 給GPU節(jié)點(diǎn)添加污點(diǎn),只允許GPU任務(wù)調(diào)度 kubectl taint nodes k8s-gpu-01 gpu=true:NoSchedule kubectl taint nodes k8s-gpu-02 gpu=true:NoSchedule # 給維護(hù)中的節(jié)點(diǎn)添加NoExecute污點(diǎn),驅(qū)逐現(xiàn)有Pod kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute # 查看節(jié)點(diǎn)污點(diǎn) kubectl describe node k8s-gpu-01 | grep -A 5 Taints # 刪除污點(diǎn) kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute-
Pod中配置Tolerations:
# 文件:gpu-job.yaml
apiVersion:batch/v1
kind:Job
metadata:
name:gpu-training-job
namespace:ml-training
spec:
template:
spec:
tolerations:
# 容忍gpu污點(diǎn)
-key:"gpu"
operator:"Equal"
value:"true"
effect:"NoSchedule"
nodeSelector:
accelerator:nvidia-tesla-v100
containers:
-name:training
image:pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
resources:
limits:
nvidia.com/gpu:1
requests:
cpu:"4"
memory:"16Gi"
restartPolicy:Never
Taint Effect說明:
NoSchedule:新Pod不會(huì)調(diào)度到該節(jié)點(diǎn),已有Pod不受影響
PreferNoSchedule:盡量不調(diào)度,但資源不足時(shí)仍可調(diào)度
NoExecute:新Pod不調(diào)度,已有Pod如果不容忍會(huì)被驅(qū)逐??梢栽O(shè)置tolerationSeconds指定驅(qū)逐前的等待時(shí)間
# 容忍N(yùn)oExecute污點(diǎn),但最多等待300秒后被驅(qū)逐 tolerations: -key:"maintenance" operator:"Equal" value:"true" effect:"NoExecute" tolerationSeconds:300
2.2.5 topologySpreadConstraints(拓?fù)浞稚⒓s束)
1.19版本GA的功能,比podAntiAffinity更精細(xì)地控制Pod在拓?fù)溆蜷g的分布:
# 文件:app-topology-spread.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:app-spread
namespace:default
spec:
replicas:9
selector:
matchLabels:
app:app-spread
template:
metadata:
labels:
app:app-spread
spec:
topologySpreadConstraints:
# 跨可用區(qū)均勻分布,最大偏差1
-maxSkew:1
topologyKey:topology.kubernetes.io/zone
whenUnsatisfiable:DoNotSchedule
labelSelector:
matchLabels:
app:app-spread
# 跨節(jié)點(diǎn)均勻分布,最大偏差1,軟約束
-maxSkew:1
topologyKey:kubernetes.io/hostname
whenUnsatisfiable:ScheduleAnyway
labelSelector:
matchLabels:
app:app-spread
containers:
-name:app
image:nginx:1.24
resources:
requests:
cpu:100m
memory:128Mi
參數(shù)說明:
maxSkew:拓?fù)溆蜷gPod數(shù)量的最大差值。設(shè)為1表示任意兩個(gè)域的Pod數(shù)量差不超過1
topologyKey:拓?fù)溆虻臉?biāo)簽key
whenUnsatisfiable:不滿足約束時(shí)的行為,DoNotSchedule(硬約束)或ScheduleAnyway(軟約束)
9個(gè)副本在3個(gè)zone中的分布結(jié)果:zone-a=3, zone-b=3, zone-c=3。如果zone-c只有1個(gè)節(jié)點(diǎn)且資源不足,DoNotSchedule會(huì)導(dǎo)致部分Pod Pending,ScheduleAnyway則會(huì)盡量均勻但允許偏差。
2.2.6 PriorityClass(優(yōu)先級(jí)和搶占)
高優(yōu)先級(jí)Pod可以搶占低優(yōu)先級(jí)Pod的資源:
# 定義優(yōu)先級(jí)類 apiVersion:scheduling.k8s.io/v1 kind:PriorityClass metadata: name:critical-production value:1000000 globalDefault:false preemptionPolicy:PreemptLowerPriority description:"生產(chǎn)核心服務(wù),可搶占低優(yōu)先級(jí)Pod" --- apiVersion:scheduling.k8s.io/v1 kind:PriorityClass metadata: name:normal-production value:500000 globalDefault:true preemptionPolicy:PreemptLowerPriority description:"普通生產(chǎn)服務(wù)" --- apiVersion:scheduling.k8s.io/v1 kind:PriorityClass metadata: name:batch-job value:100000 globalDefault:false preemptionPolicy:Never description:"批處理任務(wù),不搶占其他Pod"
在Pod中引用:
apiVersion:apps/v1
kind:Deployment
metadata:
name:core-api
spec:
replicas:3
selector:
matchLabels:
app:core-api
template:
metadata:
labels:
app:core-api
spec:
priorityClassName:critical-production
containers:
-name:api
image:myapp:v1.0
resources:
requests:
cpu:"1"
memory:"2Gi"
警告:preemptionPolicy: PreemptLowerPriority會(huì)驅(qū)逐低優(yōu)先級(jí)Pod來騰出資源,被驅(qū)逐的Pod會(huì)收到SIGTERM信號(hào)。確保應(yīng)用能正確處理優(yōu)雅關(guān)閉,否則會(huì)丟數(shù)據(jù)。生產(chǎn)環(huán)境建議批處理任務(wù)設(shè)置preemptionPolicy: Never。
2.3 啟動(dòng)和驗(yàn)證
2.3.1 驗(yàn)證調(diào)度結(jié)果
# 查看Pod調(diào)度到了哪個(gè)節(jié)點(diǎn) kubectl get pods -o wide -l app=app-spread # 查看Pod的調(diào)度事件 kubectl describe pod| grep -A 10 Events # 查看調(diào)度失敗的原因 kubectl get events --field-selector reason=FailedScheduling -A # 查看節(jié)點(diǎn)資源分配情況 kubectl describe node k8s-worker-01 | grep -A 20"Allocated resources"
2.3.2 調(diào)度模擬測(cè)試
# 使用kubectl創(chuàng)建一個(gè)dry-run的Pod,查看是否能調(diào)度成功
kubectl runtest-schedule --image=nginx:1.24 --dry-run=server -o yaml
--overrides='{
"spec": {
"nodeSelector": {"disktype": "ssd"},
"containers": [{"name": "test", "image": "nginx:1.24", "resources": {"requests": {"cpu": "100m", "memory": "128Mi"}}}]
}
}'
# 查看各節(jié)點(diǎn)的可分配資源
kubectl get nodes -o custom-columns=
NAME:.metadata.name,
CPU_ALLOC:.status.allocatable.cpu,
MEM_ALLOC:.status.allocatable.memory,
PODS_ALLOC:.status.allocatable.pods
2.3.3 驗(yàn)證拓?fù)浞稚?/p>
# 查看Pod在各zone的分布
kubectl get pods -l app=app-spread -o custom-columns=
NAME:.metadata.name,
NODE:.spec.nodeName,
ZONE:.spec.nodeName
# 更精確的方式:通過節(jié)點(diǎn)標(biāo)簽查看
forpodin$(kubectl get pods -l app=app-spread -o jsonpath='{.items[*].spec.nodeName}');do
zone=$(kubectl get node$pod-o jsonpath='{.metadata.labels.topology.kubernetes.io/zone}')
echo"Node:$pod, Zone:$zone"
done
三、示例代碼和配置
3.1 完整配置示例
3.1.1 多租戶節(jié)點(diǎn)池隔離方案
生產(chǎn)環(huán)境中不同團(tuán)隊(duì)共用一個(gè)集群,通過taint+nodeAffinity實(shí)現(xiàn)節(jié)點(diǎn)池隔離:
# 文件:namespace-resource-setup.yaml # 第一步:創(chuàng)建團(tuán)隊(duì)namespace apiVersion:v1 kind:Namespace metadata: name:team-backend labels: team:backend --- apiVersion:v1 kind:Namespace metadata: name:team-data labels: team:data --- # 第二步:為每個(gè)團(tuán)隊(duì)設(shè)置ResourceQuota apiVersion:v1 kind:ResourceQuota metadata: name:backend-quota namespace:team-backend spec: hard: requests.cpu:"20" requests.memory:40Gi limits.cpu:"40" limits.memory:80Gi pods:"100" --- apiVersion:v1 kind:ResourceQuota metadata: name:data-quota namespace:team-data spec: hard: requests.cpu:"40" requests.memory:80Gi limits.cpu:"80" limits.memory:160Gi pods:"200"
節(jié)點(diǎn)打標(biāo)簽和污點(diǎn):
# backend團(tuán)隊(duì)節(jié)點(diǎn)池
kubectl label node k8s-worker-{01..05} node-pool=backend
kubectl taint nodes k8s-worker-{01..05} node-pool=backend:NoSchedule
# data團(tuán)隊(duì)節(jié)點(diǎn)池
kubectl label node k8s-worker-{06..10} node-pool=data
kubectl taint nodes k8s-worker-{06..10} node-pool=data:NoSchedule
# 公共節(jié)點(diǎn)池(不加污點(diǎn),所有Pod都能調(diào)度)
kubectl label node k8s-worker-{11..15} node-pool=shared
團(tuán)隊(duì)Deployment模板:
# 文件:backend-app-template.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:backend-api
namespace:team-backend
spec:
replicas:5
selector:
matchLabels:
app:backend-api
template:
metadata:
labels:
app:backend-api
spec:
tolerations:
-key:"node-pool"
operator:"Equal"
value:"backend"
effect:"NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:node-pool
operator:In
values:
-backend
-shared
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
-weight:100
podAffinityTerm:
labelSelector:
matchExpressions:
-key:app
operator:In
values:
-backend-api
topologyKey:kubernetes.io/hostname
containers:
-name:api
image:backend-api:v2.1.0
ports:
-containerPort:8080
resources:
requests:
cpu:500m
memory:512Mi
limits:
cpu:"1"
memory:1Gi
readinessProbe:
httpGet:
path:/health
port:8080
initialDelaySeconds:10
periodSeconds:5
livenessProbe:
httpGet:
path:/health
port:8080
initialDelaySeconds:30
periodSeconds:10
3.1.2 調(diào)度策略自動(dòng)注入腳本
通過Kyverno策略自動(dòng)為特定namespace的Pod注入調(diào)度規(guī)則,避免每個(gè)Deployment都手動(dòng)配置:
# 文件:kyverno-scheduling-policy.yaml
apiVersion:kyverno.io/v1
kind:ClusterPolicy
metadata:
name:inject-node-affinity-backend
spec:
rules:
-name:add-backend-scheduling
match:
any:
-resources:
kinds:
-Pod
namespaces:
-team-backend
mutate:
patchStrategicMerge:
spec:
tolerations:
-key:"node-pool"
operator:"Equal"
value:"backend"
effect:"NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:node-pool
operator:In
values:
-backend
-shared
注意:Kyverno需要單獨(dú)安裝(helm install kyverno kyverno/kyverno -n kyverno --create-namespace)。這種方式比在每個(gè)Deployment里寫調(diào)度規(guī)則更易維護(hù),團(tuán)隊(duì)只需要關(guān)注業(yè)務(wù)配置,調(diào)度策略由平臺(tái)團(tuán)隊(duì)統(tǒng)一管理。
3.2 實(shí)際應(yīng)用案例
案例一:數(shù)據(jù)庫Pod的調(diào)度策略
場(chǎng)景描述:MySQL主從集群部署在K8s中,主庫需要SSD+高內(nèi)存節(jié)點(diǎn),從庫可以用普通節(jié)點(diǎn)。主從Pod不能在同一節(jié)點(diǎn)上,避免節(jié)點(diǎn)故障導(dǎo)致主從同時(shí)不可用。
實(shí)現(xiàn)代碼:
# 文件:mysql-master-statefulset.yaml
apiVersion:apps/v1
kind:StatefulSet
metadata:
name:mysql-master
namespace:database
spec:
serviceName:mysql-master
replicas:1
selector:
matchLabels:
app:mysql
role:master
template:
metadata:
labels:
app:mysql
role:master
spec:
priorityClassName:critical-production
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:disktype
operator:In
values:
-ssd
-key:workload-type
operator:In
values:
-database
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchExpressions:
-key:app
operator:In
values:
-mysql
topologyKey:kubernetes.io/hostname
containers:
-name:mysql
image:mysql:8.0.35
ports:
-containerPort:3306
env:
-name:MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name:mysql-secret
key:root-password
resources:
requests:
cpu:"2"
memory:4Gi
limits:
cpu:"4"
memory:8Gi
volumeMounts:
-name:mysql-data
mountPath:/var/lib/mysql
volumeClaimTemplates:
-metadata:
name:mysql-data
spec:
accessModes:["ReadWriteOnce"]
storageClassName:local-ssd
resources:
requests:
storage:100Gi
---
# MySQL從庫
apiVersion:apps/v1
kind:StatefulSet
metadata:
name:mysql-slave
namespace:database
spec:
serviceName:mysql-slave
replicas:2
selector:
matchLabels:
app:mysql
role:slave
template:
metadata:
labels:
app:mysql
role:slave
spec:
priorityClassName:normal-production
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchExpressions:
-key:app
operator:In
values:
-mysql
topologyKey:kubernetes.io/hostname
topologySpreadConstraints:
-maxSkew:1
topologyKey:topology.kubernetes.io/zone
whenUnsatisfiable:DoNotSchedule
labelSelector:
matchLabels:
app:mysql
role:slave
containers:
-name:mysql
image:mysql:8.0.35
ports:
-containerPort:3306
resources:
requests:
cpu:"1"
memory:2Gi
limits:
cpu:"2"
memory:4Gi
volumeMounts:
-name:mysql-data
mountPath:/var/lib/mysql
volumeClaimTemplates:
-metadata:
name:mysql-data
spec:
accessModes:["ReadWriteOnce"]
storageClassName:standard
resources:
requests:
storage:100Gi
運(yùn)行結(jié)果:
NAME READY STATUS NODE ZONE mysql-master-0 1/1 Running k8s-worker-01 zone-a (SSD+database節(jié)點(diǎn)) mysql-slave-0 1/1 Running k8s-worker-02 zone-b (不同節(jié)點(diǎn)不同zone) mysql-slave-1 1/1 Running k8s-worker-03 zone-c (不同節(jié)點(diǎn)不同zone)
案例二:混合調(diào)度策略實(shí)現(xiàn)灰度發(fā)布
場(chǎng)景描述:灰度發(fā)布時(shí),新版本Pod先調(diào)度到特定的灰度節(jié)點(diǎn),驗(yàn)證通過后再擴(kuò)展到所有節(jié)點(diǎn)。通過標(biāo)簽和調(diào)度策略控制灰度范圍。
實(shí)現(xiàn)步驟:
給灰度節(jié)點(diǎn)打標(biāo)簽:
kubectl label node k8s-worker-01 canary=true
灰度版本Deployment:
# 文件:app-canary.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:myapp-canary
namespace:production
spec:
replicas:2
selector:
matchLabels:
app:myapp
version:canary
template:
metadata:
labels:
app:myapp
version:canary
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:canary
operator:In
values:
-"true"
containers:
-name:myapp
image:myapp:v2.0.0-rc1
ports:
-containerPort:8080
resources:
requests:
cpu:200m
memory:256Mi
灰度驗(yàn)證通過后,全量發(fā)布:
# 移除灰度節(jié)點(diǎn)約束,更新穩(wěn)定版Deployment的鏡像 kubectlsetimage deployment/myapp-stable myapp=myapp:v2.0.0 -n production # 縮容灰度Deployment kubectl scale deployment/myapp-canary --replicas=0 -n production # 清理灰度標(biāo)簽 kubectl label node k8s-worker-01 canary-
四、最佳實(shí)踐和注意事項(xiàng)
4.1 最佳實(shí)踐
4.1.1 性能優(yōu)化
減少podAffinity的使用范圍:podAffinity/podAntiAffinity的調(diào)度計(jì)算復(fù)雜度高,在500+節(jié)點(diǎn)集群中,一個(gè)帶podAffinity的Pod調(diào)度耗時(shí)從5ms增加到200ms。能用topologySpreadConstraints替代的場(chǎng)景優(yōu)先用topologySpreadConstraints。
# 不推薦:用podAntiAffinity實(shí)現(xiàn)分散
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchLabels:
app:myapp
topologyKey:kubernetes.io/hostname
# 推薦:用topologySpreadConstraints替代
topologySpreadConstraints:
-maxSkew:1
topologyKey:kubernetes.io/hostname
whenUnsatisfiable:DoNotSchedule
labelSelector:
matchLabels:
app:myapp
合理設(shè)置resource requests:調(diào)度器根據(jù)requests而非limits做調(diào)度決策。requests設(shè)太高導(dǎo)致節(jié)點(diǎn)利用率低(實(shí)測(cè)平均CPU利用率只有15%),設(shè)太低導(dǎo)致節(jié)點(diǎn)超賣嚴(yán)重,Pod被OOMKill。建議requests設(shè)為實(shí)際使用量的P95值。
# 查看Pod實(shí)際資源使用,作為requests設(shè)置參考 kubectl top pods -n production --sort-by=cpu kubectl top pods -n production --sort-by=memory
使用Descheduler重平衡:節(jié)點(diǎn)擴(kuò)容或Pod漂移后,集群負(fù)載可能不均衡。Descheduler可以驅(qū)逐不符合當(dāng)前調(diào)度策略的Pod,讓調(diào)度器重新調(diào)度。
# 安裝Descheduler helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/ helm install descheduler descheduler/descheduler -n kube-system --setschedule="*/5 * * * *"
4.1.2 安全加固
限制nodeName直接指定:nodeName會(huì)跳過調(diào)度器的所有檢查(包括資源檢查),生產(chǎn)環(huán)境通過RBAC限制普通用戶使用nodeName字段。
# OPA/Gatekeeper策略:禁止使用nodeName apiVersion:constraints.gatekeeper.sh/v1beta1 kind:K8sDenyNodeName metadata: name:deny-nodename spec: match: kinds: -apiGroups:[""] kinds:["Pod"] excludedNamespaces:["kube-system"]
PriorityClass權(quán)限控制:高優(yōu)先級(jí)PriorityClass的創(chuàng)建和使用需要限制,防止普通用戶創(chuàng)建高優(yōu)先級(jí)Pod搶占核心服務(wù)資源。
# RBAC:只允許admin使用critical-production優(yōu)先級(jí) apiVersion:rbac.authorization.k8s.io/v1 kind:ClusterRole metadata: name:use-critical-priority rules: -apiGroups:["scheduling.k8s.io"] resources:["priorityclasses"] resourceNames:["critical-production"] verbs:["get","list"]
節(jié)點(diǎn)污點(diǎn)防篡改:關(guān)鍵節(jié)點(diǎn)(如GPU節(jié)點(diǎn))的污點(diǎn)被誤刪會(huì)導(dǎo)致普通Pod涌入。通過準(zhǔn)入控制webhook攔截對(duì)特定節(jié)點(diǎn)taint的修改操作。
4.1.3 高可用配置
HA方案一:核心服務(wù)至少3副本,配合podAntiAffinity硬約束分散到不同節(jié)點(diǎn),再用topologySpreadConstraints分散到不同可用區(qū)
HA方案二:使用PodDisruptionBudget(PDB)限制同時(shí)不可用的Pod數(shù)量,防止節(jié)點(diǎn)維護(hù)時(shí)服務(wù)中斷
apiVersion:policy/v1 kind:PodDisruptionBudget metadata: name:myapp-pdb spec: minAvailable:2 selector: matchLabels: app:myapp
備份策略:調(diào)度策略配置(PriorityClass、節(jié)點(diǎn)標(biāo)簽、污點(diǎn))納入GitOps管理,用ArgoCD或FluxCD同步
4.2 注意事項(xiàng)
4.2.1 配置注意事項(xiàng)
警告:調(diào)度策略配置錯(cuò)誤可能導(dǎo)致Pod無法調(diào)度或調(diào)度到錯(cuò)誤節(jié)點(diǎn),修改前在測(cè)試環(huán)境驗(yàn)證。
注意nodeAffinity的requiredDuringSchedulingIgnoredDuringExecution中,多個(gè)nodeSelectorTerms之間是OR關(guān)系,同一個(gè)nodeSelectorTerm中的多個(gè)matchExpressions之間是AND關(guān)系。搞混了會(huì)導(dǎo)致調(diào)度結(jié)果不符合預(yù)期。
注意podAntiAffinity硬約束會(huì)限制副本數(shù)上限。如果topologyKey是kubernetes.io/hostname,副本數(shù)不能超過可用節(jié)點(diǎn)數(shù),否則多出來的Pod永遠(yuǎn)Pending。
注意topologySpreadConstraints的labelSelector必須和Pod自身的標(biāo)簽匹配,否則約束不生效。這個(gè)錯(cuò)誤很隱蔽,Pod能正常調(diào)度但分布不均勻。
4.2.2 常見錯(cuò)誤
| 錯(cuò)誤現(xiàn)象 | 原因分析 | 解決方案 |
|---|---|---|
| Pod一直Pending,事件顯示0/6 nodes are available | nodeSelector或nodeAffinity沒有匹配的節(jié)點(diǎn) | 檢查節(jié)點(diǎn)標(biāo)簽是否正確:kubectl get nodes --show-labels |
| Pod調(diào)度成功但分布不均勻 | topologySpreadConstraints的labelSelector寫錯(cuò) | 確認(rèn)labelSelector和Pod的labels完全一致 |
| 高優(yōu)先級(jí)Pod搶占后,被搶占的Pod無法重新調(diào)度 | 集群資源不足,被搶占的Pod也找不到合適節(jié)點(diǎn) | 擴(kuò)容節(jié)點(diǎn)或降低resource requests |
| taint添加后已有Pod沒被驅(qū)逐 | 使用了NoSchedule而非NoExecute | NoSchedule只影響新Pod,要驅(qū)逐已有Pod用NoExecute |
| 節(jié)點(diǎn)維護(hù)drain失敗 | Pod有PDB限制,minAvailable不滿足 | 先擴(kuò)容副本數(shù),或臨時(shí)調(diào)整PDB的minAvailable |
4.2.3 兼容性問題
版本兼容:topologySpreadConstraints在1.18 Beta、1.19 GA;minDomains字段在1.25 Beta,使用前確認(rèn)集群版本
平臺(tái)兼容:云廠商托管K8s通常自動(dòng)設(shè)置topology.kubernetes.io/zone標(biāo)簽,自建集群需要手動(dòng)設(shè)置
組件依賴:Descheduler版本需要和K8s版本匹配,0.27.x支持K8s 1.25-1.28
五、故障排查和監(jiān)控
5.1 故障排查
5.1.1 日志查看
# 查看kube-scheduler日志 kubectl logs -n kube-system -l component=kube-scheduler --tail=100 # 查看調(diào)度事件 kubectl get events -A --field-selector reason=FailedScheduling --sort-by='.lastTimestamp' # 查看特定Pod的調(diào)度事件 kubectl describe pod-n | grep -A 20 Events # 查看節(jié)點(diǎn)資源分配詳情 kubectl describe node | grep -A 30"Allocated resources"
5.1.2 常見問題排查
問題一:Pod Pending,報(bào)Insufficient cpu
# 診斷命令 kubectl describe pod| grep -A 5"Events" kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU_REQ:.status.allocatable.cpu,MEM_REQ:.status.allocatable.memory # 查看各節(jié)點(diǎn)已分配資源 fornodein$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}');do echo"===$node===" kubectl describe node$node| grep -A 5"Allocated resources" done
解決方案:
檢查Pod的resource requests是否設(shè)置過高
檢查節(jié)點(diǎn)是否有足夠的可分配資源(allocatable - 已分配)
考慮擴(kuò)容節(jié)點(diǎn)或優(yōu)化現(xiàn)有Pod的資源配置
問題二:調(diào)度策略不生效,Pod沒有按預(yù)期分布
# 診斷命令 kubectl get pod-o yaml | grep -A 50"affinity" kubectl get pod -o yaml | grep -A 20"topologySpreadConstraints" # 檢查節(jié)點(diǎn)標(biāo)簽是否正確 kubectl get nodes --show-labels | grep
解決方案:
確認(rèn)YAML縮進(jìn)正確,affinity字段層級(jí)關(guān)系容易寫錯(cuò)
確認(rèn)labelSelector和Pod標(biāo)簽一致
用kubectl apply --dry-run=server驗(yàn)證YAML語法
問題三:節(jié)點(diǎn)drain卡住不動(dòng)
癥狀:kubectl drain命令長時(shí)間無響應(yīng)
排查:
# 查看哪些Pod阻止了drain kubectl drain--ignore-daemonsets --delete-emptydir-data --dry-run=client # 檢查PDB kubectl get pdb -A
解決:
DaemonSet的Pod加--ignore-daemonsets跳過
使用emptyDir的Pod加--delete-emptydir-data
PDB限制導(dǎo)致的,先擴(kuò)容副本再drain
有Pod設(shè)置了terminationGracePeriodSeconds很長,加--timeout=300s限制等待時(shí)間
5.1.3 調(diào)試模式
# 提高scheduler日志級(jí)別 # 編輯/etc/kubernetes/manifests/kube-scheduler.yaml # 在command中添加 --v=10(最詳細(xì),僅調(diào)試用) # 查看scheduler的調(diào)度決策過程 kubectl logs -n kube-system kube-scheduler-k8s-master-01 | grep"pod-name" # 使用kubectl-scheduler-simulator模擬調(diào)度(需要單獨(dú)安裝) # https://github.com/kubernetes-sigs/kube-scheduler-simulator # 查看scheduler的metrics kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
5.2 性能監(jiān)控
5.2.1 關(guān)鍵指標(biāo)監(jiān)控
# 調(diào)度延遲 kubectl get --raw /metrics | grep scheduler_scheduling_algorithm_duration_seconds # 調(diào)度隊(duì)列長度 kubectl get --raw /metrics | grep scheduler_pending_pods # 搶占次數(shù) kubectl get --raw /metrics | grep scheduler_preemption_victims # 節(jié)點(diǎn)資源使用率 kubectl top nodes
5.2.2 監(jiān)控指標(biāo)說明
| 指標(biāo)名稱 | 正常范圍 | 告警閾值 | 說明 |
|---|---|---|---|
| 調(diào)度延遲(P99) | <100ms | >500ms | 超過500ms說明調(diào)度器過載或策略過于復(fù)雜 |
| Pending Pod數(shù)量 | 0 | >10持續(xù)5分鐘 | 持續(xù)有Pod無法調(diào)度需要排查 |
| 調(diào)度失敗率 | <1% | >5% | 高失敗率說明資源不足或策略配置有問題 |
| 節(jié)點(diǎn)CPU分配率 | 40%-70% | >85% | 分配率過高會(huì)導(dǎo)致新Pod無法調(diào)度 |
| 節(jié)點(diǎn)內(nèi)存分配率 | 50%-80% | >90% | 接近100%時(shí)需要擴(kuò)容 |
| 搶占事件數(shù) | 0 | >5次/小時(shí) | 頻繁搶占說明資源規(guī)劃不合理 |
5.2.3 Prometheus監(jiān)控規(guī)則
# 文件:scheduler-alerts.yaml
apiVersion:monitoring.coreos.com/v1
kind:PrometheusRule
metadata:
name:scheduler-alerts
namespace:monitoring
spec:
groups:
-name:kube-scheduler
rules:
-alert:SchedulerHighLatency
expr:|
histogram_quantile(0.99,
sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le)
) > 0.5
for:10m
labels:
severity:warning
annotations:
summary:"Scheduler P99 latency exceeds 500ms"
-alert:PodsPendingTooLong
expr:|
sum(scheduler_pending_pods{queue="active"}) > 10
for:5m
labels:
severity:warning
annotations:
summary:"More than 10 pods pending for over 5 minutes"
-alert:SchedulerUnhealthy
expr:absent(up{job="kube-scheduler"}==1)
for:3m
labels:
severity:critical
annotations:
summary:"kube-scheduler is not running"
-alert:NodeHighAllocation
expr:|
(1 - sum(kube_node_status_allocatable{resource="cpu"} - kube_pod_container_resource_requests{resource="cpu"}) by (node)
/ sum(kube_node_status_allocatable{resource="cpu"}) by (node)) > 0.85
for:10m
labels:
severity:warning
annotations:
summary:"Node{{ $labels.node }}CPU allocation exceeds 85%"
-alert:FrequentPreemption
expr:|
increase(scheduler_preemption_victims[1h]) > 5
for:5m
labels:
severity:warning
annotations:
summary:"More than 5 preemption events in the last hour"
5.3 備份與恢復(fù)
5.3.1 備份策略
#!/bin/bash
# 調(diào)度策略配置備份腳本
# 文件:/opt/scripts/scheduling-config-backup.sh
set-euo pipefail
BACKUP_DIR="/data/scheduling-backup/$(date +%Y%m%d)"
mkdir -p"${BACKUP_DIR}"
# 備份PriorityClass
kubectl get priorityclass -o yaml >"${BACKUP_DIR}/priorityclasses.yaml"
# 備份PDB
kubectl get pdb -A -o yaml >"${BACKUP_DIR}/pdbs.yaml"
# 備份節(jié)點(diǎn)標(biāo)簽和污點(diǎn)
fornodein$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}');do
kubectl get node"$node"-o jsonpath='{.metadata.labels}'>"${BACKUP_DIR}/${node}-labels.json"
kubectl get node"$node"-o jsonpath='{.spec.taints}'>"${BACKUP_DIR}/${node}-taints.json"
done
# 備份Kyverno策略(如果使用)
kubectl get clusterpolicy -o yaml >"${BACKUP_DIR}/kyverno-policies.yaml"2>/dev/null ||true
echo"[$(date)] Scheduling config backup completed:${BACKUP_DIR}"
5.3.2 恢復(fù)流程
停止服務(wù):暫停業(yè)務(wù)部署,避免恢復(fù)過程中的調(diào)度沖突
恢復(fù)數(shù)據(jù):kubectl apply -f ${BACKUP_DIR}/priorityclasses.yaml
驗(yàn)證完整性:kubectl get priorityclass確認(rèn)PriorityClass恢復(fù)
恢復(fù)節(jié)點(diǎn)配置:逐節(jié)點(diǎn)恢復(fù)標(biāo)簽和污點(diǎn)
驗(yàn)證調(diào)度:創(chuàng)建測(cè)試Pod驗(yàn)證調(diào)度策略是否生效
六、總結(jié)
6.1 技術(shù)要點(diǎn)回顧
要點(diǎn)一:nodeSelector適合簡單場(chǎng)景,nodeAffinity適合需要軟硬約束結(jié)合的場(chǎng)景,兩者都基于節(jié)點(diǎn)標(biāo)簽,標(biāo)簽規(guī)劃是基礎(chǔ)
要點(diǎn)二:podAntiAffinity硬約束會(huì)限制副本數(shù)不超過節(jié)點(diǎn)數(shù),大規(guī)模集群中計(jì)算開銷大,優(yōu)先用topologySpreadConstraints替代
要點(diǎn)三:taints/tolerations從節(jié)點(diǎn)角度控制調(diào)度,適合節(jié)點(diǎn)專用化場(chǎng)景(GPU節(jié)點(diǎn)、數(shù)據(jù)庫節(jié)點(diǎn)),NoExecute會(huì)驅(qū)逐已有Pod
要點(diǎn)四:PriorityClass的搶占機(jī)制要謹(jǐn)慎使用,批處理任務(wù)設(shè)置preemptionPolicy: Never,核心服務(wù)設(shè)置高優(yōu)先級(jí)
要點(diǎn)五:topologySpreadConstraints是生產(chǎn)環(huán)境跨故障域部署的首選方案,maxSkew: 1配合DoNotSchedule保證嚴(yán)格均勻分布
6.2 進(jìn)階學(xué)習(xí)方向
自定義調(diào)度器:當(dāng)內(nèi)置調(diào)度策略無法滿足需求時(shí),可以開發(fā)自定義調(diào)度器,通過Scheduling Framework擴(kuò)展點(diǎn)實(shí)現(xiàn)
學(xué)習(xí)資源:Scheduling Framework
實(shí)踐建議:先用調(diào)度器擴(kuò)展(Extender)驗(yàn)證邏輯,再考慮寫Framework插件
Descheduler深度使用:配置LowNodeUtilization、RemoveDuplicates等策略,自動(dòng)重平衡集群負(fù)載
學(xué)習(xí)資源:Descheduler GitHub
實(shí)踐建議:先在非生產(chǎn)環(huán)境測(cè)試Descheduler策略,避免誤驅(qū)逐核心服務(wù)
Volcano批調(diào)度器:針對(duì)AI/大數(shù)據(jù)場(chǎng)景的批調(diào)度器,支持Gang Scheduling(一組Pod要么全部調(diào)度成功,要么全部不調(diào)度)
6.3 參考資料
Kubernetes調(diào)度官方文檔 - 調(diào)度機(jī)制全面說明
kube-scheduler源碼 - 理解調(diào)度算法實(shí)現(xiàn)
Descheduler項(xiàng)目 - Pod重調(diào)度工具
Volcano項(xiàng)目 - 批調(diào)度器
附錄
A. 命令速查表
# 節(jié)點(diǎn)標(biāo)簽管理 kubectl label nodekey=value # 添加標(biāo)簽 kubectl label node key=value --overwrite # 修改標(biāo)簽 kubectl label node key- # 刪除標(biāo)簽 kubectl get nodes --show-labels # 查看所有標(biāo)簽 kubectl get nodes -L key1,key2 # 查看指定標(biāo)簽列 # 污點(diǎn)管理 kubectl taint nodes key=value:NoSchedule # 添加污點(diǎn) kubectl taint nodes key=value:NoSchedule- # 刪除污點(diǎn) kubectl taint nodes key- # 刪除指定key的所有污點(diǎn) kubectl describe node | grep Taints # 查看污點(diǎn) # 調(diào)度排查 kubectl get events --field-selector reason=FailedScheduling -A # 調(diào)度失敗事件 kubectl describe pod | grep -A 10 Events # Pod事件 kubectl get pods -o wide # 查看Pod所在節(jié)點(diǎn) kubectl top nodes # 節(jié)點(diǎn)資源使用 kubectl describe node | grep -A 20"Allocated resources"# 已分配資源 # 節(jié)點(diǎn)維護(hù) kubectl drain --ignore-daemonsets --delete-emptydir-data # 騰空節(jié)點(diǎn) kubectl uncordon # 恢復(fù)調(diào)度 kubectl cordon # 標(biāo)記不可調(diào)度
B. 配置參數(shù)詳解
nodeAffinity操作符:
| 操作符 | 含義 | 示例 |
|---|---|---|
| In | 值在列表中 | values: ["ssd", "nvme"] |
| NotIn | 值不在列表中 | values: ["hdd"] |
| Exists | 標(biāo)簽存在 | 不需要values字段 |
| DoesNotExist | 標(biāo)簽不存在 | 不需要values字段 |
| Gt | 值大于 | values: ["100"] (數(shù)字字符串) |
| Lt | 值小于 | values: ["50"] (數(shù)字字符串) |
Taint Effect對(duì)比:
| Effect | 對(duì)新Pod | 對(duì)已有Pod | 適用場(chǎng)景 |
|---|---|---|---|
| NoSchedule | 不調(diào)度 | 不影響 | 節(jié)點(diǎn)專用化 |
| PreferNoSchedule | 盡量不調(diào)度 | 不影響 | 軟限制 |
| NoExecute | 不調(diào)度 | 驅(qū)逐 | 節(jié)點(diǎn)維護(hù)、故障隔離 |
topologySpreadConstraints參數(shù):
| 參數(shù) | 類型 | 說明 |
|---|---|---|
| maxSkew | int | 拓?fù)溆蜷gPod數(shù)量最大差值,1表示嚴(yán)格均勻 |
| topologyKey | string | 拓?fù)溆驑?biāo)簽key |
| whenUnsatisfiable | string | DoNotSchedule 硬約束 /ScheduleAnyway軟約束 |
| labelSelector | object | 匹配Pod的標(biāo)簽選擇器 |
| minDomains | int | 最少拓?fù)溆驍?shù)量(1.25 Beta) |
| matchLabelKeys | []string | 用于區(qū)分不同版本的Pod(1.27 Beta) |
C. 術(shù)語表
| 術(shù)語 | 英文 | 解釋 |
|---|---|---|
| 預(yù)選 | Filtering | 調(diào)度第一階段,過濾不滿足條件的節(jié)點(diǎn) |
| 優(yōu)選 | Scoring | 調(diào)度第二階段,對(duì)候選節(jié)點(diǎn)打分排序 |
| 親和性 | Affinity | Pod傾向于調(diào)度到滿足條件的節(jié)點(diǎn)或Pod附近 |
| 反親和性 | Anti-Affinity | Pod傾向于遠(yuǎn)離滿足條件的Pod |
| 污點(diǎn) | Taint | 節(jié)點(diǎn)上的排斥標(biāo)記,阻止不容忍的Pod調(diào)度 |
| 容忍 | Toleration | Pod上的聲明,表示可以容忍節(jié)點(diǎn)的污點(diǎn) |
| 拓?fù)溆?/td> | Topology Domain | 按標(biāo)簽劃分的節(jié)點(diǎn)組,如同一可用區(qū)的節(jié)點(diǎn) |
| 搶占 | Preemption | 高優(yōu)先級(jí)Pod驅(qū)逐低優(yōu)先級(jí)Pod以獲取資源 |
| PDB | PodDisruptionBudget | Pod中斷預(yù)算,限制同時(shí)不可用的Pod數(shù)量 |
-
gpu
+關(guān)注
關(guān)注
28文章
5191瀏覽量
135403 -
數(shù)據(jù)庫
+關(guān)注
關(guān)注
7文章
4016瀏覽量
68325 -
kubernetes
+關(guān)注
關(guān)注
0文章
263瀏覽量
9492
原文標(biāo)題:別再讓 Pod “亂跑”:Kubernetes 調(diào)度策略原理與落地指南
文章出處:【微信號(hào):magedu-Linux,微信公眾號(hào):馬哥Linux運(yùn)維】歡迎添加關(guān)注!文章轉(zhuǎn)載請(qǐng)注明出處。
發(fā)布評(píng)論請(qǐng)先 登錄
Kubernetes的Device Plugin設(shè)計(jì)解讀
阿里云容器Kubernetes監(jiān)控(二) - 使用Grafana展現(xiàn)Pod監(jiān)控?cái)?shù)據(jù)
從零開始入門 K8s| 詳解 Pod 及容器設(shè)計(jì)模式
不吹不黑,今天我們來聊一聊 Kubernetes 落地的三種方式
淺談Kubernetes集群的高可用方案
基于5G邊緣計(jì)算的資源調(diào)度策略Kubernetes
深入研究Kubernetes調(diào)度
Kubernetes組件pod核心原理
Kubernetes中的Pod簡易理解
Kubernetes Pod如何獨(dú)立工作
Kubernetes Pod如何獲取IP地址呢?
配置Kubernetes中Pod使用代理的兩種常見方式
Kubernetes Pod調(diào)度策略原理與落地指南
評(píng)論