【智簡聯接,萬物互聯】華為云·云享專家董昕:Serverless和微服務下, IoT的變革蓄勢待發
1854
2022-05-30
CPU Manager是干什么的?
熟悉docker的用戶,一定用過docker cpuset的能力,用來指定docker container啟動時綁定指定的cpu和memory node。
--cpuset-cpus="" CPUs?in?which?to?allow?execution?(0-3,?0,1)--cpuset-mems="" Memory?nodes?(MEMs)?in?which?to?allow?execution?(0-3,?0,1).?Only?effective?on?NUMA?systems.
但是Kubernetes一直沒有提供提供的能力,直到Kubernetes 1.8開始,Kubernetes提供了CPU Manager特性來支持cpuset的能力。從Kubernetes 1.10版本開始到目前的1.12,該特性還是Beta版。
CPU Manager是Kubelet CM中的一個模塊,目標是通過給某些Containers綁定指定的cpus,達到綁定cpus的目標,從而提升這些cpu敏感型任務的性能。
什么場景下會考慮用CPU Manager?
前面提到CPU敏感型任務,會因為使用CpuSet而大幅度提升計算性能,那到底具備哪些特點的任務是屬于CPU敏感型的呢?
Sensitive to CPU throttling effects.
Sensitive to context switches.
Sensitive to processor cache misses.
Benefits from sharing a processor resources (e.g., data and instruction caches).
Sensitive to cross-socket memory traffic.
Sensitive or requires hyperthreads from the same physical CPU core.
如何使用CPU Manager
在Kubernetes v1.8-1.9版本中,CPU Manager還是Alpha,在v1.10-1.12是Beta。我沒關注過CPU Manager這幾個版本的Changelog,還是建議在1.10之后的版本中使用。
Enable CPU Manager
確保kubelet中CPUManager Feature Gate為true(BETA - default=true)
目前CPU Manager支持兩種Policy,分別為none和static,通過kubelet --cpu-manager-policy設置,未來會增加dynamic policy做Container生命周期內的cpuset動態調整。
none: 為cpu manager的默認值,相當于沒有啟用cpuset的能力。cpu request對應到cpu share,cpu limit對應到cpu quota。
static: 目前,請設置--cpu-manager-policy=static來啟用,kubelet將在Container啟動前分配綁定的cpu set,分配時還會考慮cpu topology來提升cpu affinity,后面會提到。
確保kubelet為--kube-reserved和--system-reserved都配置了值,可以不是整數個cpu,最終會計算reserved cpus時會向上取整。這樣做的目的是為了防止CPU Manager把Node上所有的cpu cores分配出去了,導致kubelet及系統進程都沒有可用的cpu了。
注意CPU Manager還有一個配置項--cpu-manager-reconcile-period,用來配置CPU Manager Reconcile Kubelet內存中CPU分配情況到cpuset cgroups的修復周期。如果沒有配置該項,那么將使用--node-status-update-frequency(default 10s)配置的值。
Workload選項
完成了以上配置,就啟用了Static CPU Manager,接下來就是在Workload中使用了。Kubernetes要求使用CPU Manager的Pod、Container具備以下兩個條件:
Pod QoS為Guaranteed;
Pod中該Container的Cpu request必須為整數CPUs;
spec: ??containers: ??-?name:?nginx ????image:?nginx ????resources: ??????limits: ????????memory:?"200Mi" ????????cpu:?"2" ??????requests: ????????memory:?"200Mi" ????????cpu:?"2"
任何其他情況下的Containers,CPU Manager都不會為其分配綁定的CPUs,而是通過CFS使用Shared Pool中的CPUs。Shared Pool中的CPU集,就是Node上CPUCapacity - ReservedCPUs - ExclusiveCPUs。
CPU Manager工作流
CPU Manager為滿足條件的Container分配指定的CPUs時,會盡量按照CPU Topology來分配,也就是考慮CPU Affinity,按照如下的優先順序進行CPUs選擇:(Logic CPUs就是Hyperthreads)
如果Container請求的Logic CPUs數量不小于單塊CPU Socket中Logci CPUs數量,那么會優先把整塊CPU Socket中的Logic CPUs分配給該Container。
如果Container剩余請求的Logic CPUs數量不小于單塊物理CPU Core提供的Logic CPUs數量,那么會優先把整塊物理CPU Core上的Logic CPUs分配給該Container。
Container剩余請求的Logic CPUs則從按照如下規則排好序的Logic CPUs列表中選擇:
number of CPUs available on the same socket
number of CPUs available on the same core
pkg/kubelet/cm/cpumanager/cpu_assignment.go:149func?takeByTopology(topo?*topology.CPUTopology,?availableCPUs?cpuset.CPUSet,?numCPUs?int)?(cpuset.CPUSet,?error)?{ acc?:=?newCPUAccumulator(topo,?availableCPUs,?numCPUs) if?acc.isSatisfied()?{ return?acc.result,?nil } if?acc.isFailed()?{ return?cpuset.NewCPUSet(),?fmt.Errorf("not?enough?cpus?available?to?satisfy?request") } //?Algorithm:?topology-aware?best-fit //?1.?Acquire?whole?sockets,?if?available?and?the?container?requires?at //????least?a?socket's-worth?of?CPUs. for?_,?s?:=?range?acc.freeSockets()?{ if?acc.needs(acc.topo.CPUsPerSocket())?{ glog.V(4).Infof("[cpumanager]?takeByTopology:?claiming?socket?[%d]",?s) acc.take(acc.details.CPUsInSocket(s)) if?acc.isSatisfied()?{ return?acc.result,?nil } } } //?2.?Acquire?whole?cores,?if?available?and?the?container?requires?at?least //????a?core's-worth?of?CPUs. for?_,?c?:=?range?acc.freeCores()?{ if?acc.needs(acc.topo.CPUsPerCore())?{ glog.V(4).Infof("[cpumanager]?takeByTopology:?claiming?core?[%d]",?c) acc.take(acc.details.CPUsInCore(c)) if?acc.isSatisfied()?{ return?acc.result,?nil } } } //?3.?Acquire?single?threads,?preferring?to?fill?partially-allocated?cores //????on?the?same?sockets?as?the?whole?cores?we?have?already?taken?in?this //????allocation. for?_,?c?:=?range?acc.freeCPUs()?{ glog.V(4).Infof("[cpumanager]?takeByTopology:?claiming?CPU?[%d]",?c) if?acc.needs(1)?{ acc.take(cpuset.NewCPUSet(c)) } if?acc.isSatisfied()?{ return?acc.result,?nil } } return?cpuset.NewCPUSet(),?fmt.Errorf("failed?to?allocate?cpus")}
Discovering CPU topology
CPU Manager能正常工作的前提,是發現Node上的CPU Topology,Discovery這部分工作是由cAdvisor完成的。
在cAdvisor的MachineInfo中通過Topology會記錄cpu和mem的Topology信息。其中Topology的每個Node對象就是對應一個CPU Socket。
vendor/github.com/google/cadvisor/info/v1/machine.go type?MachineInfo?struct?{ //?The?number?of?cores?in?this?machine. NumCores?int?`json:"num_cores"` ... //?Machine?Topology //?Describes?cpu/memory?layout?and?hierarchy. Topology?[]Node?`json:"topology"` ...}type?Node?struct?{ Id?int?`json:"node_id"` //?Per-node?memory Memory?uint64??`json:"memory"` Cores??[]Core??`json:"cores"` Caches?[]Cache?`json:"caches"`}
cAdvisor通過GetTopology來完成信息的構建,主要是通過提取/proc/cpuinfo中信息來完成CPU Topology,通過讀取/sys/devices/system/cpu/cpu來獲取cpu cache信息。
vendor/github.com/google/cadvisor/machine/machine.go func?GetTopology(sysFs?sysfs.SysFs,?cpuinfo?string)?([]info.Node,?int,?error)?{ nodes?:=?[]info.Node{} ... return?nodes,?numCores,?nil}
下面是一個典型的NUMA CPU Topology結構:
創建容器
對于滿足前面提到的滿足static policy的Container創建時,kubelet會為其按照約定的cpu affinity來為其挑選最優的CPU Set。Container的創建時CPU Manager工作流程大致如下:
Kuberuntime調用容器運行時去創建該Container。
Kuberuntime將該Container交給CPU Manager處理。
CPU Manager為Container按照static policy邏輯進行處理。
CPU Manager從當前Shared Pool中挑選“最佳”Set拓撲結構的CPU,對于不滿足Static Policy的Contianer,則返回Shared Pool中所有CPUS組成的Set。
CPU Manager將對該Container的CPUs分配情況記錄到Checkpoint State中,并且從Shared Pool中刪除剛分配的CPUs。
CPU Manager再從state中讀取該Container的CPU分配信息,然后通過UpdateContainerResources cRI接口將其更新到Cpuset Cgroups中,包括對于非Static Policy Container。
Kuberuntime調用容器運行時Start該容器。
func?(m?*manager)?AddContainer(p?*v1.Pod,?c?*v1.Container,?containerID?string)?error?{ m.Lock() err?:=?m.policy.AddContainer(m.state,?p,?c,?containerID) if?err?!=?nil?{ glog.Errorf("[cpumanager]?AddContainer?error:?%v",?err) m.Unlock() return?err } cpus?:=?m.state.GetCPUSetOrDefault(containerID) m.Unlock() if?!cpus.IsEmpty()?{ err?=?m.updateContainerCPUSet(containerID,?cpus) if?err?!=?nil?{ glog.Errorf("[cpumanager]?AddContainer?error:?%v",?err) return?err } }?else?{ glog.V(5).Infof("[cpumanager]?update?container?resources?is?skipped?due?to?cpu?set?is?empty") } return?nil}
刪除容器
當這些通過CPU Manager分配CPUs的Container要Delete時,CPU Manager工作流大致如下:
Kuberuntime會調用CPU Manager去按照static policy中定義邏輯處理。
CPU Manager將該Container分配的Cpu Set重新歸還到Shared Pool中。
Kuberuntime調用容器運行時Remove該容器。
CPU Manager會異步地進行Reconcile Loop,為使用Shared Pool中的Cpus的Containers更新CPU集合。
func?(m?*manager)?RemoveContainer(containerID?string)?error?{ m.Lock() defer?m.Unlock() err?:=?m.policy.RemoveContainer(m.state,?containerID) if?err?!=?nil?{ glog.Errorf("[cpumanager]?RemoveContainer?error:?%v",?err) return?err } return?nil}
Checkpoint
文件壞了,或者被刪除了,該如何操作?
Note: CPU Manager doesn’t support offlining and onlining of CPUs at runtime. Also, if the set of online CPUs changes on the node, the node must be drained and CPU manager manually reset by deleting the state file cpu_manager_state in the kubelet root directory.
在Container Manager創建時,會順帶完成CPU Manager的創建。我們看看創建CPU Manager時做了什么?我們也就清楚了Kubelet重啟時CPU Manager做了什么。
//?NewManager?creates?new?cpu?manager?based?on?provided?policyfunc?NewManager(cpuPolicyName?string,?reconcilePeriod?time.Duration,?machineInfo?*cadvisorapi.MachineInfo,?nodeAllocatableReservation?v1.ResourceList,?stateFileDirectory?string)?(Manager,?error)?{ var?policy?Policy switch?policyName(cpuPolicyName)?{ case?PolicyNone: policy?=?NewNonePolicy() case?PolicyStatic: topo,?err?:=?topology.Discover(machineInfo) if?err?!=?nil?{ return?nil,?err } glog.Infof("[cpumanager]?detected?CPU?topology:?%v",?topo) reservedCPUs,?ok?:=?nodeAllocatableReservation[v1.ResourceCPU] if?!ok?{ //?The?static?policy?cannot?initialize?without?this?information. return?nil,?fmt.Errorf("[cpumanager]?unable?to?determine?reserved?CPU?resources?for?static?policy") } if?reservedCPUs.IsZero()?{ //?The?static?policy?requires?this?to?be?nonzero.?Zero?CPU?reservation //?would?allow?the?shared?pool?to?be?completely?exhausted.?At?that?point //?either?we?would?violate?our?guarantee?of?exclusivity?or?need?to?evict //?any?pod?that?has?at?least?one?container?that?requires?zero?CPUs. //?See?the?comments?in?policy_static.go?for?more?details. return?nil,?fmt.Errorf("[cpumanager]?the?static?policy?requires?systemreserved.cpu?+?kubereserved.cpu?to?be?greater?than?zero") } //?Take?the?ceiling?of?the?reservation,?since?fractional?CPUs?cannot?be //?exclusively?allocated. reservedCPUsFloat?:=?float64(reservedCPUs.MilliValue())?/?1000 numReservedCPUs?:=?int(math.Ceil(reservedCPUsFloat)) policy?=?NewStaticPolicy(topo,?numReservedCPUs) default: glog.Errorf("[cpumanager]?Unknown?policy?\"%s\",?falling?back?to?default?policy?\"%s\"",?cpuPolicyName,?PolicyNone) policy?=?NewNonePolicy() } stateImpl,?err?:=?state.NewCheckpointState(stateFileDirectory,?cpuManagerStateFileName,?policy.Name()) if?err?!=?nil?{ return?nil,?fmt.Errorf("could?not?initialize?checkpoint?manager:?%v",?err) } manager?:=?&manager{ policy:?????????????????????policy, reconcilePeriod:????????????reconcilePeriod, state:??????????????????????stateImpl, machineInfo:????????????????machineInfo, nodeAllocatableReservation:?nodeAllocatableReservation, } return?manager,?nil}
調用topology.Discover將cAdvisormachineInfo.Topology封裝成CPU Manager管理的CPUTopology。
然后計算reservedCPUs(KubeReservedCPUs + SystemReservedCPUs + HardEvictionThresholds),并向上取整,最終最為reserved cpus。如果reservedCPUs為零,將返回Error,因為我們必須static policy必須要求System Reserved和Kube Reserved不為空。
調用NewStaticPolicy創建static policy,創建時會調用takeByTopology為reserved cpus按照static policy挑選cpus的邏輯選擇對應的CPU Set,最終設置到StaticPolicy.reserved中(注意,并沒有真正為reserved cpu set更新到cgroups,而是添加到Default CPU Set中,并且不被static policy Containers分配,這樣Default CPU Set永遠不會為空,它至少包含reserved CPU Set中的CPUs)。在AddContainer allocateCPUs計算assignableCPUs時,會除去這些reserved CPU Set。
接下來,調用state.NewCheckpointState,創建cpu_manager_state Checkpoint文件(如果存在,則不清空),初始Memory State,并從Checkpoint文件中restore到Memory State中。
cpu_manager_state Checkpoint文件內容就是CPUManagerCheckpoint結構體的json格式,其中Entries的key是ContainerID,value為該Container對應的Assigned CPU Set信息。
//?CPUManagerCheckpoint?struct?is?used?to?store?cpu/pod?assignments?in?a?checkpointtype?CPUManagerCheckpoint?struct?{ PolicyName????string????????????`json:"policyName"` DefaultCPUSet?string????????????`json:"defaultCpuSet"` Entries???????map[string]string?`json:"entries,omitempty"` Checksum??????checksum.Checksum?`json:"checksum"`}
接下來就是CPU Manager的啟動了。
func?(m?*manager)?Start(activePods?ActivePodsFunc,?podStatusProvider?status.PodStatusProvider,?containerRuntime?runtimeService)?{ glog.Infof("[cpumanager]?starting?with?%s?policy",?m.policy.Name()) glog.Infof("[cpumanager]?reconciling?every?%v",?m.reconcilePeriod) m.activePods?=?activePods m.podStatusProvider?=?podStatusProvider m.containerRuntime?=?containerRuntime m.policy.Start(m.state) if?m.policy.Name()?==?string(PolicyNone)?{ return } go?wait.Until(func()?{?m.reconcileState()?},?m.reconcilePeriod,?wait.NeverStop)}
啟動static policy;
啟動Reconcile Loop;
Reconcile Loop到底做了什么?
CPU Manager Reconcile按照--cpu-manager-reconcile-period配置的周期進行Loop,Reconcile注意進行如下處理:
遍歷所有activePods中的所有Containers,注意包括InitContainers,對每個Container繼續進行下面處理。
檢查該ContainerID是否在CPU Manager維護的Memory State assignments中,
再檢查對應的Pod.Status.Phase是否為Running且DeletionTimestamp為nil,如果是,則調用CPU Manager的AddContainer對該Container/Pod進行QoS和cpu request檢查,如果滿足static policy的條件,則調用takeByTopology為該Container分配“最佳”CPU Set,并寫入到Memory State和Checkpoint文件(cpu_manager_sate)中,并繼續后面流程。
如果對應的Pod.Status.Phase是否為Running且DeletionTimestamp為nil為false,則跳過該Container,該Container處理結束。不滿足static policy的Containers因為不在Memory State assignments中,所以對它們的處理流程也到此結束。
如果不在Memory State assignments中:
如果ContainerID在CPU Manager assignments維護的Memory State中,繼續后面流程。
然后從Memory State中獲取該ContainerID對應的CPU Set。
最后調用CRI UpdateContainerCPUSet更新到cpuset cgroups中。
pkg/kubelet/cm/cpumanager/cpu_manager.go:219func?(m?*manager)?reconcileState()?(success?[]reconciledContainer,?failure?[]reconciledContainer)?{ success?=?[]reconciledContainer{} failure?=?[]reconciledContainer{} for?_,?pod?:=?range?m.activePods()?{ allContainers?:=?pod.Spec.InitContainers allContainers?=?append(allContainers,?pod.Spec.Containers...) for?_,?container?:=?range?allContainers?{ status,?ok?:=?m.podStatusProvider.GetPodStatus(pod.UID) if?!ok?{ glog.Warningf("[cpumanager]?reconcileState:?skipping?pod;?status?not?found?(pod:?%s,?container:?%s)",?pod.Name,?container.Name) failure?=?append(failure,?reconciledContainer{pod.Name,?container.Name,?""}) break } containerID,?err?:=?findContainerIDByName(&status,?container.Name) if?err?!=?nil?{ glog.Warningf("[cpumanager]?reconcileState:?skipping?container;?ID?not?found?in?status?(pod:?%s,?container:?%s,?error:?%v)",?pod.Name,?container.Name,?err) failure?=?append(failure,?reconciledContainer{pod.Name,?container.Name,?""}) continue } //?Check?whether?container?is?present?in?state,?there?may?be?3?reasons?why?it's?not?present: //?-?policy?does?not?want?to?track?the?container //?-?kubelet?has?just?been?restarted?-?and?there?is?no?previous?state?file //?-?container?has?been?removed?from?state?by?RemoveContainer?call?(DeletionTimestamp?is?set) if?_,?ok?:=?m.state.GetCPUSet(containerID);?!ok?{ if?status.Phase?==?v1.PodRunning?&&?pod.DeletionTimestamp?==?nil?{ glog.V(4).Infof("[cpumanager]?reconcileState:?container?is?not?present?in?state?-?trying?to?add?(pod:?%s,?container:?%s,?container?id:?%s)",?pod.Name,?container.Name,?containerID) err?:=?m.AddContainer(pod,?&container,?containerID) if?err?!=?nil?{ glog.Errorf("[cpumanager]?reconcileState:?failed?to?add?container?(pod:?%s,?container:?%s,?container?id:?%s,?error:?%v)",?pod.Name,?container.Name,?containerID,?err) failure?=?append(failure,?reconciledContainer{pod.Name,?container.Name,?containerID}) continue } }?else?{ //?if?DeletionTimestamp?is?set,?pod?has?already?been?removed?from?state //?skip?the?pod/container?since?it's?not?running?and?will?be?deleted?soon continue } } cset?:=?m.state.GetCPUSetOrDefault(containerID) if?cset.IsEmpty()?{ //?NOTE:?This?should?not?happen?outside?of?tests. glog.Infof("[cpumanager]?reconcileState:?skipping?container;?assigned?cpuset?is?empty?(pod:?%s,?container:?%s)",?pod.Name,?container.Name) failure?=?append(failure,?reconciledContainer{pod.Name,?container.Name,?containerID}) continue } glog.V(4).Infof("[cpumanager]?reconcileState:?updating?container?(pod:?%s,?container:?%s,?container?id:?%s,?cpuset:?\"%v\")",?pod.Name,?container.Name,?containerID,?cset) err?=?m.updateContainerCPUSet(containerID,?cset) if?err?!=?nil?{ glog.Errorf("[cpumanager]?reconcileState:?failed?to?update?container?(pod:?%s,?container:?%s,?container?id:?%s,?cpuset:?\"%v\",?error:?%v)",?pod.Name,?container.Name,?containerID,?cset,?err) failure?=?append(failure,?reconciledContainer{pod.Name,?container.Name,?containerID}) continue } success?=?append(success,?reconciledContainer{pod.Name,?container.Name,?containerID}) } } return?success,?failure}
Validate State
CPU Manager啟動時,除了會啟動一個goruntime進行Reconcile以外,還會對State進行validate處理:
當Memory State中Shared(Default) CPU Set為空時,CPU Assginments也必須為空,然后對Memory State中的Shared Pool進行初始化,并寫入到Checkpoint文件中(初始化Checkpoint)。
只要我們沒有手動去刪Checkpoint文件,那么在前面提到的state.NewCheckpointState中會根據Checkpoint文件restore到Memory State中,因此之前Assgned CPU Set、Default CPU Set都還在。
當檢測到Memory State已經成功初始化(根據Checkpoint restore),則檢查這次啟動時reserved cpu set是否都在Default CPU Set中,如果不是(比如kube/system reserved cpus增加了),則報錯返回,因為這意味著reserved cpu set中有些cpus被Assigned到了某些Container中了,這可能會導致這些容器啟動失敗,此時需要用戶自己手動的去修正Checkpoint文件。
檢測reserved cpu set通過后,再檢測Default CPU Set和Assigned CPU Set是否有交集,如果有交集,說明Checkpoint文件restore到Memory State的數據有錯,報錯返回。
最后檢查這次啟動時從cAdvisor中獲取到的CPU Topology中的所有CPUs是否與Memory State(從Checkpoint中restore)中記錄的所有CPUs(Default CPU Set + Assigned CPU Set)相同,如果不同,則報錯返回。可能因為上次CPU Manager停止到這次啟動這個時間內,Node上的可用CPUs發生變化。
pkg/kubelet/cm/cpumanager/policy_static.go:116func?(p?*staticPolicy)?validateState(s?state.State)?error?{ tmpAssignments?:=?s.GetCPUAssignments() tmpDefaultCPUset?:=?s.GetDefaultCPUSet() //?Default?cpuset?cannot?be?empty?when?assignments?exist if?tmpDefaultCPUset.IsEmpty()?{ if?len(tmpAssignments)?!=?0?{ return?fmt.Errorf("default?cpuset?cannot?be?empty") } //?state?is?empty?initialize allCPUs?:=?p.topology.CPUDetails.CPUs() s.SetDefaultCPUSet(allCPUs) return?nil } //?State?has?already?been?initialized?from?file?(is?not?empty) //?1.?Check?if?the?reserved?cpuset?is?not?part?of?default?cpuset?because: //?-?kube/system?reserved?have?changed?(increased)?-?may?lead?to?some?containers?not?being?able?to?start //?-?user?tampered?with?file if?!p.reserved.Intersection(tmpDefaultCPUset).Equals(p.reserved)?{ return?fmt.Errorf("not?all?reserved?cpus:?\"%s\"?are?present?in?defaultCpuSet:?\"%s\"", p.reserved.String(),?tmpDefaultCPUset.String()) } //?2.?Check?if?state?for?static?policy?is?consistent for?cID,?cset?:=?range?tmpAssignments?{ //?None?of?the?cpu?in?DEFAULT?cset?should?be?in?s.assignments if?!tmpDefaultCPUset.Intersection(cset).IsEmpty()?{ return?fmt.Errorf("container?id:?%s?cpuset:?\"%s\"?overlaps?with?default?cpuset?\"%s\"", cID,?cset.String(),?tmpDefaultCPUset.String()) } } //?3.?It's?possible?that?the?set?of?available?CPUs?has?changed?since //?the?state?was?written.?This?can?be?due?to?for?example //?offlining?a?CPU?when?kubelet?is?not?running.?If?this?happens, //?CPU?manager?will?run?into?trouble?when?later?it?tries?to //?assign?non-existent?CPUs?to?containers.?Validate?that?the //?topology?that?was?received?during?CPU?manager?startup?matches?with //?the?set?of?CPUs?stored?in?the?state. totalKnownCPUs?:=?tmpDefaultCPUset.Clone() for?_,?cset?:=?range?tmpAssignments?{ totalKnownCPUs?=?totalKnownCPUs.Union(cset) } if?!totalKnownCPUs.Equals(p.topology.CPUDetails.CPUs())?{ return?fmt.Errorf("current?set?of?available?CPUs?\"%s\"?doesn't?match?with?CPUs?in?state?\"%s\"", p.topology.CPUDetails.CPUs().String(),?totalKnownCPUs.String()) } return?nil}
思考
某個CPU在Shared Pool中被非Guaranteed Pod Containers使用時,后來被CPU Manager分配給某個Static Policy Container,那么原來這個CPU上的任務會怎么樣?立刻被調度到其他Shared Pool中的CPUs嗎?
由于Static Policy Container Add的時候,除了為自己挑選最佳CPU Set外,還會把挑選的CPU Set從Shared Pool CPU Set中刪除,因此上面這種情況下,原來的這個CPU上的任務會繼續執行等cpu scheduler下次調度任務時,因為cpuset cgroups的生效,將導致他們看不到原來的那塊CPU了。
Static Policy Container從頭到尾都一定是綁定分配的CPUs嗎?
從前面分析的工作流可知,當某Static Policy Container被分配了某些CPUs后,通過每10s(默認)一次的Reconcile將Memory State中分配情況更新到cpuset cgroups中,因此最壞會有10s時間這個Static Policy Container將和非Static Policy Container共享這個CPU。
CPU Manager的Checkpoint文件被破壞,與實際的CPU Assigned情況不一致,該如何修復?
通過對CPU Manager的分析,我們知道Reconcile并不能自己修復這個差異??梢酝ㄟ^以下方法修復:
方法1:重新生成Checkpoint文件:刪除Checkpoint文件,并重啟Kubelet,CPU Manager的Reconcile機制會遍歷所有Containers,并重新為這些滿足Static Policy條件的Containers分配CPUs,并更新到cpuset cgroups中。這可能會導致運行中的Container重新被分配到不同的CPU Set中而出現短時間的應用抖動。
方法2:Drain這個node,將Pod驅逐走,讓Pod在其他正常Checkpoint的Node上調度,然后清空或者刪除Checkpoint文件。這個方法也會對應用造成一點的影響,畢竟Pod需要在其他Node上recreate。
CPU Manager的不足
基于當前cAdvisor對CPU Topology的Discover能力,目前CPU Manager在為Container挑選CPUs考慮cpu socket是否靠近某些PCI Bus。
CPU Manager還不支持對isolcpus Linux kernel boot parameter的兼容,CPU Manager需要(通過cAdvisor或者直接讀)獲取isolcpus配置的isolate CPUs,并在給Static Policy Contaienrs分配時排除這些isolate CPUs。
還不支持Dynamic分配,在Container運行過程中直接更改起cpuset cgroups。
總結
通過對Kubelet CPU Manager的深入分析,我們對CPU Manager的工作機制有了充分的理解,包括其Reconcile Loop、啟動時的Validate Sate機制、Checkpoint的機制及其修復方法、CPU Manager當前不足等。
Kubernates
版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。