Customizing the Kubernetes Scheduler

kube-scheduler is one of the core components of kubernetes, mainly responsible for the scheduling function of the entire cluster resources, according to specific scheduling algorithms and policies, Pods will be scheduled to the optimal working nodes, so as to make more reasonable and full use of the cluster resources, which is also a very important reason why we choose to use kubernetes This is a very important reason why we chose to use kubernetes. If a new technology doesn’t help companies save money and provide efficiency, I believe it’s hard to move forward.

Scheduling Process

By default, the default scheduler provided by kube-scheduler meets most of our requirements, and the examples we’ve talked to earlier have largely used the default policy to ensure that our Pods can be assigned to nodes with sufficient resources to run. However, in a real online project, we may know our application better than kubernetes, for example, we want a Pod to run on only a few nodes, or these nodes can only be used to run certain types of applications, which requires our scheduler to be controllable.

The main purpose of kube-scheduler is to schedule Pods to the appropriate Node node based on a specific scheduling algorithm and scheduling policy. It is a standalone binary that listens to the API Server all the time after startup, gets Pods with empty PodSpec.NodeName, and creates a binding for each Pod.

This process may seem relatively simple to us, but in a real production environment, there are many issues to consider.

How to ensure the fairness of all node scheduling? It is important to know that not all nodes have the same resource allocation.
How to ensure that each node can be allocated resources?
How can the cluster resources be used efficiently?
How can the cluster resources be maximized?
How to ensure the performance and efficiency of Pod scheduling?
Can users customize their own scheduling policies according to their actual needs?

Considering the various complexities in real-world environments, the kubernetes scheduler is implemented in a plug-in form, which allows users to customize or develop the scheduler as a plug-in and integrate it with kubernetes.

The source code of the kubernetes scheduler is located in kubernetes/pkg/scheduler, and the general code directory structure is as follows: (the directory structure may not be the same for different versions)

kubernetes/pkg/scheduler
-- scheduler.go         //调度相关的具体实现
|-- algorithm
|   |-- predicates      //节点筛选策略
|   |-- priorities      //节点打分策略
|-- algorithmprovider
|   |-- defaults         //定义默认的调度器

The core program that creates and runs the Scheduler is in pkg/scheduler/scheduler.go and the entry program for kube-scheduler is in cmd/kube-scheduler/scheduler.go.

Customizing the Scheduler

Generally speaking, we have 4 ways to extend the Kubernetes scheduler.

One way is to directly clone the official kube-scheduler source code, modify the code directly where appropriate, and then recompile and run the modified program. This method is of course the least recommended and impractical, as it requires a lot of extra effort to keep up with the upstream scheduler changes.
The default scheduler and our custom scheduler can be overridden by the Pod’s spec.schedulerName. The default default scheduler is used by default, but it is also troublesome when multiple schedulers coexist, for example, when multiple schedulers schedule For example, when multiple schedulers schedule Pods to the same node, you may encounter some problems because it is likely that both schedulers will schedule both Pods to the same node at the same time, but it is likely that one of the Pods will actually run out of resources. It is not easy to maintain a high quality custom scheduler because we need a comprehensive knowledge of the default scheduler, the overall Kubernetes architecture, and the various relationships or limitations of the various Kubernetes API objects.
The third approach is the scheduler extender, which is currently a viable solution that is compatible with the upstream scheduler. The so-called scheduler extension is actually a configurable Webhook that contains filter and priority endpoints corresponding to the two main phases of the scheduling cycle (filtering and scoring).
The fourth approach is through the Scheduling Framework. The pluggable architecture of the Scheduling Framework was introduced in Kubernetes v1.15, making the task of customizing the scheduler much easier. The Scheduling Framework adds a set of plug-in APIs to the existing scheduler, which keeps the “core” of the scheduler simple and easy to maintain while making most of the scheduling functionality available in the form of plug-ins, and the above scheduler extensions have been deprecated in our current v1.16 release. So the scheduling framework is the core way to customize the scheduler in the future.

Here we can briefly describe the implementation of the latter two approaches.

Scheduler Extensions

Before we get into the scheduler extensions, let’s take a look at how the Kubernetes scheduler works:

the default scheduler is started with the specified parameters (we built the cluster with kubeadm and the startup configuration file is located at /etc/kubernetes/manifests/kube-schdueler.yaml)
watch apiserver and put the Pods with empty spec.nodeName into the scheduler’s internal scheduling queue
Pop out a Pod from the scheduling queue and start a standard scheduling cycle
Retrieve the “hard requirements” (e.g. CPU/memory request values, nodeSelector/nodeAffinity) from the Pod properties, and then a filtering phase occurs, where a candidate list of nodes satisfying the requirements is calculated
retrieves the “soft requirements” from the Pod property and applies some default “soft policies” (e.g. Pods tend to be more clustered or spread out on nodes), and finally, it gives a score to each candidate node and selects the final one with the highest score 6.
communicates with the apiserver (sends a binding call) and then sets the Pod’s spec.nodeName property to indicate the node to which the Pod will be dispatched.

We can specify which parameters the scheduler will use by looking at the official documentation with the --config parameter, which configuration file should contain a KubeSchedulerConfiguration object in the following format: (/etc/kubernetes/scheduler-extender.yaml)

 # 通过"--config" 传递文件内容
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: "/etc/kubernetes/scheduler.conf"
algorithmSource:
  policy:
    file:
      path: "/etc/kubernetes/scheduler-extender-policy.yaml"  # 指定自定义调度策略文件

The key parameter we should enter here is algorithmSource.policy, this policy file can be either a local file or a ConfigMap resource object, depending on how the scheduler is deployed, for example, our default scheduler here is started as a static Pod, so we can configure it as a local file.

The policy file /etc/kubernetes/scheduler-extender-policy.yaml should follow kubernetes/pkg/scheduler/apis/config/legacy_types.go#L28, in our v1.16.2 release here we already support both JSON and YAML format policy files, here is a simple example of our definition, you can see the Extender for a description of the policy file definition specification.

apiVersion: v1
kind: Policy
extenders:
- urlPrefix: "http://127.0.0.1:8888/"
  filterVerb: "filter"
  prioritizeVerb: "prioritize"
  weight: 1
  enableHttps: false

Our Policy policy file here extends the scheduler by defining extenders, sometimes we don’t need to write the code, we can customize it directly in this configuration file by specifying predicates and priorities, if not specified then the default DefaultProvier will be used.

{
    "kind": "Policy",
    "apiVersion": "v1",
    "predicates": [{
        "name": "MatchNodeSelector"
    }, {
        "name": "PodFitsResources"
    }, {
        "name": "PodFitsHostPorts"
    },{
        "name": "HostName"
    }
    ],
    "priorities": [{
        "name": "EqualPriority",
        "weight": 2
    }, {
        "name": "ImageLocalityPriority",
        "weight": 4
    }, {
        "name": "LeastRequestedPriority",
        "weight": 2
    }, {
        "name": "BalancedResourceAllocation",
        "weight": 2
    }
    ],
    "extenders": [{
        "urlPrefix": "/prefix",
        "filterVerb": "filter",
        "prioritizeVerb": "prioritize",
        "weight": 1,
        "bindVerb": "bind",
        "enableHttps": false
    }]
}

The policy change file defines an HTTP extender service running under 127.0.0.1:8888 and has registered the policy with the default scheduler so that at the end of the filtering and scoring phases, the results can be passed to the extender’s endpoints <urlPrefix>/<filterVerb> and < urlPrefix>/<prioritizeVerb>, where we can further filter and prioritize in the extender to suit our specific business needs.

Example

Let’s implement a simple scheduler extension directly in golang, but of course you can use any other programming language, as follows.

func main() {
    router := httprouter.New()
    router.GET("/", Index)
    router.POST("/filter", Filter)
    router.POST("/prioritize", Prioritize)

    log.Fatal(http.ListenAndServe(":8888", router))
}

Then we need to implement the /filter and /prioritize endpoint handlers.

The extension function Filter takes an argument with input type schedulerapi.ExtenderArgs and returns a value of type *schedulerapi.ExtenderFilterResult. In the function, we can further filter the input nodes: the

// filter 根据扩展程序定义的预选规则来过滤节点
func filter(args schedulerapi.ExtenderArgs) *schedulerapi.ExtenderFilterResult {
	var filteredNodes []v1.Node
	failedNodes := make(schedulerapi.FailedNodesMap)
	pod := args.Pod

	for _, node := range args.Nodes.Items {
		fits, failReasons, _ := podFitsOnNode(pod, node)
		if fits {
			filteredNodes = append(filteredNodes, node)
		} else {
			failedNodes[node.Name] = strings.Join(failReasons, ",")
		}
	}

	result := schedulerapi.ExtenderFilterResult{
		Nodes: &v1.NodeList{
			Items: filteredNodes,
		},
		FailedNodes: failedNodes,
		Error:       "",
	}

	return &result
}

In the filter function, we loop through each node and then use our own implemented business logic to determine whether the node should be approved or not, here our implementation is relatively simple, in the podFitsOnNode() function we simply check whether the random number is even to determine, if so we consider it a lucky node, otherwise we refuse to approve the node.

var predicatesSorted = []string{LuckyPred}

var predicatesFuncs = map[string]FitPredicate{
    LuckyPred: LuckyPredicate,
}

type FitPredicate func(pod *v1.Pod, node v1.Node) (bool, []string, error)

func podFitsOnNode(pod *v1.Pod, node v1.Node) (bool, []string, error) {
    fits := true
    var failReasons []string
    for _, predicateKey := range predicatesSorted {
        fit, failures, err := predicatesFuncs[predicateKey](pod, node)
        if err != nil {
            return false, nil, err
        }
        fits = fits && fit
        failReasons = append(failReasons, failures...)
    }
    return fits, failReasons, nil
}

func LuckyPredicate(pod *v1.Pod, node v1.Node) (bool, []string, error) {
    lucky := rand.Intn(2) == 0
    if lucky {
        log.Printf("pod %v/%v is lucky to fit on node %v\n", pod.Name, pod.Namespace, node.Name)
        return true, nil, nil
    }
    log.Printf("pod %v/%v is unlucky to fit on node %v\n", pod.Name, pod.Namespace, node.Name)
    return false, []string{LuckyPredFailMsg}, nil
}

The same scoring function is implemented in the same way, where we give a random score on each node:

// it's webhooked to pkg/scheduler/core/generic_scheduler.go#PrioritizeNodes()
// 这个函数输出的分数会被添加会默认的调度器
func prioritize(args schedulerapi.ExtenderArgs) *schedulerapi.HostPriorityList {
	pod := args.Pod
	nodes := args.Nodes.Items

	hostPriorityList := make(schedulerapi.HostPriorityList, len(nodes))
	for i, node := range nodes {
		score := rand.Intn(schedulerapi.MaxPriority + 1)  // 在最大优先级内随机取一个值
		log.Printf(luckyPrioMsg, pod.Name, pod.Namespace, score)
		hostPriorityList[i] = schedulerapi.HostPriority{
			Host:  node.Name,
			Score: score,
		}
	}

	return &hostPriorityList
}

We can then use the following command to compile and package our application.

`1`	`$ GOOS=linux GOARCH=amd64 go build -o app`

The complete code for this section of the scheduler extension is available at: https://github.com/cnych/sample-scheduler-extender.

Once the build is complete, copy the application app to the node where kube-scheduler is located and run it directly. Now we can configure the above policy file into the kube-scheduler component. Our cluster here is built by kubeadm, so we can directly modify the file /etc/kubernetes/manifests/kube-schduler.yaml with the following content.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --config=/etc/kubernetes/scheduler-extender.yaml
    - --v=9
    image: gcr.azk8s.cn/google_containers/kube-scheduler:v1.16.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10251
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/scheduler-extender.yaml
      name: extender
      readOnly: true
    - mountPath: /etc/kubernetes/scheduler-extender-policy.yaml
      name: extender-policy
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /etc/kubernetes/scheduler-extender.yaml
      type: FileOrCreate
    name: extender
  - hostPath:
      path: /etc/kubernetes/scheduler-extender-policy.yaml
      type: FileOrCreate
    name: extender-policy
status: {}

Of course we are configuring this directly on the default kube-scheduler, but we can also copy a YAML file of the scheduler and change the schedulerName to deploy it without affecting the default scheduler, and then specify spec. schedulerName on the Pod that needs to use this test scheduler. For using multiple schedulers, see the official documentation Configuring Multiple Schedulers.

After reconfiguring kube-scheduler, you can check the logs to verify that the reboot was successful, but be sure to add /etc/kubernetes/scheduler-extender.yaml and /etc/kubernetes/scheduler-extender- policy.yaml files into the Pod: /etc/kubernetes/scheduler-extender.yaml.

$ kubectl logs -f kube-scheduler-ydzs-master -n kube-system
I0102 15:17:38.824657       1 serving.go:319] Generated self-signed cert in-memory
I0102 15:17:39.472276       1 server.go:143] Version: v1.16.2
I0102 15:17:39.472674       1 defaults.go:91] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W0102 15:17:39.479704       1 authorization.go:47] Authorization is disabled
W0102 15:17:39.479733       1 authentication.go:79] Authentication is disabled
I0102 15:17:39.479777       1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0102 15:17:39.480559       1 secure_serving.go:123] Serving securely on 127.0.0.1:10259
I0102 15:17:39.682180       1 leaderelection.go:241] attempting to acquire leader lease  kube-system/kube-scheduler...
I0102 15:17:56.500505       1 leaderelection.go:251] successfully acquired lease kube-system/kube-scheduler

Here we have created and configured a very simple scheduler extension, now let’s run a Deployment to see how it works, we prepare a deployment Yaml with 20 copies: (test-scheduler.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
spec:
  replicas: 20
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      containers:
      - name: pause
        image: gcr.azk8s.cn/google_containers/pause:3.1

Create the above resource object directly.

1
2

$ kuectl apply -f test-scheduler.yaml
deployment.apps/pause created

At this point we go to the log of the scheduler extension we wrote.

$ ./app
......
2020/01/03 12:27:29 pod pause-58584fbc95-bwn7t/default is unlucky to fit on node ydzs-node1
2020/01/03 12:27:29 pod pause-58584fbc95-bwn7t/default is lucky to get score 7
2020/01/03 12:27:29 pod pause-58584fbc95-bwn7t/default is lucky to get score 9
2020/01/03 12:27:29 pod pause-58584fbc95-86w92/default is unlucky to fit on node ydzs-node3
2020/01/03 12:27:29 pod pause-58584fbc95-86w92/default is unlucky to fit on node ydzs-node4
2020/01/03 12:27:29 pod pause-58584fbc95-86w92/default is lucky to fit on node ydzs-node1
2020/01/03 12:27:29 pod pause-58584fbc95-86w92/default is lucky to fit on node ydzs-node2
2020/01/03 12:27:29 pod pause-58584fbc95-86w92/default is lucky to get score 4
2020/01/03 12:27:29 pod pause-58584fbc95-86w92/default is lucky to get score 8
......

We can see the process of Pod scheduling, in addition the default scheduler will periodically retry failed Pods, so they will be re-passed again and again to our scheduler extension, our logic is to check if the random number is even, so eventually all Pods will be in running state.

The scheduler extender may be able to meet our needs in some cases, but he still has some limitations and drawbacks.

Communication cost: data is transferred between the default scheduler and the scheduler extender with http(s), which has some cost when performing serialization and deserialization
Limited extension points: extensions can only participate at the end of certain phases, such as "Filter" and "Prioritize", and they cannot be called at the beginning or middle of any phase
Subtraction over addition: compared to the node candidate list passed by the default scheduler, we may have some requirements to add a new candidate node list, but this is a riskier operation because there is no guarantee that the new node will pass other requirements, so it is better for the scheduler extensions to perform "Subtraction" (further filtering) rather than "Addition" (adding nodes)
Cache sharing: the above is just a simple test example, but in a real project we are required to make scheduling decisions by looking at the state of the whole cluster. The default scheduler can schedule decisions well, but cannot share its cache, which means we have to build and maintain our own cache

Due to these limitations, the Kubernetes scheduling team came up with the fourth method above for better extensions, the Scheduler Framework, which basically solves all the problems we encounter and is now the official recommended extension, so it will be the most mainstream way to extend the scheduler in the future.

Scheduling Framework

The scheduling framework defines a set of extension points. Users can implement the interfaces defined by the extension points to define their own scheduling logic (we call them extensions) and register extensions to the extension points. The scheduling framework will call the extensions registered by the user when it encounters the corresponding extension points when executing the scheduling workflow. Scheduling frameworks have a specific purpose when reserving extensions, some extensions on extension points can change the scheduler’s decision method, some extensions on extension points just send a notification.

We know that whenever a Pod is scheduled, it is executed according to two processes: the scheduling process and the binding process.

The scheduling process selects a suitable node for the Pod, and the binding process applies the decisions of the scheduling process to the cluster (i.e., runs the Pod on the selected node), combining the scheduling and binding processes in what is called the scheduling context. Note that the scheduling process runs `synchronously’ (scheduling for only one Pod at the same point in time), and the binding process can run asynchronously (binding for multiple Pods at the same point in time).

The scheduling process and the binding process will exit in the middle of the process when they encounter the following conditions.

The scheduler thinks there is no optional node for the Pod
Internal error

In this case, the Pod is put back into the pending queue and waits for the next retry.

Extension Points

The following diagram shows the scheduling context in the scheduling framework and the extension points in it. An extension can register multiple extension points so that more complex stateful tasks can be executed.

The QueueSort extension is used to sort the queue of Pods to be dispatched to decide which Pod to be dispatched first. can only have one QueueSort plugin in effect at the same point in time.
Pre-filter extension is used to pre-process the Pod information or check some prerequisites that must be met by the cluster or Pod, if pre-filter returns an error, the scheduling process is terminated.
The Filter extension is used to exclude nodes that cannot run the Pod. For each node, the scheduler will execute the filter extensions in order; if any of the filters marks the node as unselectable, the remaining filter extensions will not be executed. The scheduler can execute filter extensions on multiple nodes at the same time.
Post-filter is a notification type extension point that is called with the argument of a list of nodes that have been filtered as selectable nodes at the end of the filter phase, which can be used in the extension to update the internal state, or to generate log or metrics information.
The Scoring extension is used to score all the selectable nodes. The scheduler will call the Soring extension for each node, and the scoring result is an integer in a range. During the normalize scoring phase, the scheduler will combine the scoring results of each scoring extension for a specific node with the extension’s weight as the final scoring result.
The normalize scoring extension modifies the scoring result of each node before the scheduler performs the final sorting of the nodes. scoring` extensions of all plugins once for each scheduling session performed by the scheduling framework. 4.
Reserve is a notification extension that stateful plugins can use to obtain the resources reserved for Pods on a node. This event occurs before the scheduler binds a Pod to a node, in order to avoid the situation where the actual resources used exceed the resources available when the scheduler schedules a new Pod to the node while waiting for the Pod to be bound to the node. (Because binding a Pod to a node happens asynchronously). This is the last step of the scheduling process, after the Pod enters the reserved state, either the Unreserve extension is triggered when the binding fails or the binding process is ended by the Post-bind extension when the binding succeeds.
The Permit extension is used to prevent or delay the binding of a Pod to a node. the Permit extension can do one of the following three things.

approve: When all permit extensions have approved the binding of the Pod to the node, the scheduler will continue the binding process
deny: If any of the permit extensions deny the binding of a Pod to a node, the Pod will be put back in the queue to be scheduled and the Unreserve extension will be triggered.
wait: If a permit extension returns wait, the Pod will remain in the permit phase until it is approved by another extension. If a timeout event occurs and the wait state becomes deny, the Pod will be put back in the queue to be dispatched and the Unreserve extension will be triggered.

The Pre-bind extension is used to perform certain logic before the Pod is bound. For example, the pre-bind extension can mount a network-based data volume to a node so that the Pod can use it. If any of the pre-bind extensions return an error, the Pod will be put back in the queue to be dispatched, at which point the Unreserve extension will be triggered.
The Bind extension is used to bind a Pod to a node.

The bind extension is executed only when all pre-bind extensions have been successfully executed
The scheduling framework invokes bind extensions one by one in the order in which they are registered
A specific bind extension can choose to process or not process the Pod
If a bind extension handles the binding of the Pod to a node, the remaining bind extensions will be ignored

Post-bind is a notification extension.

Post-bind extension is called passively after a Pod is successfully bound to a node
The Post-bind extension is the last step of the binding process and can be used to perform resource cleanup actions

Unreserve is a notification extension that is called if a resource is reserved for a Pod and the Pod is denied binding during the binding process. the unreserve extension should release the compute resources on the node that has been reserved for the Pod. The reserve extension and the unreserve extension should appear in pairs in a plugin.

If we want to implement our own plugin, we must register the plugin with the scheduling framework and complete the configuration, in addition to implementing the extension point interface, which corresponds to the extension point interface we can find in the source code pkg/scheduler/framework/v1alpha1/interface.go file, as follows.

// PreFilterPlugin is an interface that must be implemented by "prefilter" plugins.
// These plugins are called at the beginning of the scheduling cycle.
type PreFilterPlugin interface {
	Plugin
	PreFilter(pc *PluginContext, p *v1.Pod) *Status
}

// FilterPlugin is an interface for Filter plugins. These plugins are called at the
// filter extension point for filtering out hosts that cannot run a pod.
// This concept used to be called 'predicate' in the original scheduler.
// These plugins should return "Success", "Unschedulable" or "Error" in Status.code.
// However, the scheduler accepts other valid codes as well.
// Anything other than "Success" will lead to exclusion of the given host from
// running the pod.
type FilterPlugin interface {
	Plugin
	Filter(pc *PluginContext, pod *v1.Pod, nodeName string) *Status
}

// PostFilterPlugin is an interface for Post-filter plugin. Post-filter is an
// informational extension point. Plugins will be called with a list of nodes
// that passed the filtering phase. A plugin may use this data to update internal
// state or to generate logs/metrics.
type PostFilterPlugin interface {
	Plugin
	PostFilter(pc *PluginContext, pod *v1.Pod, nodes []*v1.Node, filteredNodesStatuses NodeToStatusMap) *Status
}

// ScorePlugin is an interface that must be implemented by "score" plugins to rank
// nodes that passed the filtering phase.
type ScorePlugin interface {
	Plugin
	Score(pc *PluginContext, p *v1.Pod, nodeName string) (int, *Status)
}

// ScoreWithNormalizePlugin is an interface that must be implemented by "score"
// plugins that also need to normalize the node scoring results produced by the same
// plugin's "Score" method.
type ScoreWithNormalizePlugin interface {
	ScorePlugin
	NormalizeScore(pc *PluginContext, p *v1.Pod, scores NodeScoreList) *Status
}

// ReservePlugin is an interface for Reserve plugins. These plugins are called
// at the reservation point. These are meant to update the state of the plugin.
// This concept used to be called 'assume' in the original scheduler.
// These plugins should return only Success or Error in Status.code. However,
// the scheduler accepts other valid codes as well. Anything other than Success
// will lead to rejection of the pod.
type ReservePlugin interface {
	Plugin
	Reserve(pc *PluginContext, p *v1.Pod, nodeName string) *Status
}

// PreBindPlugin is an interface that must be implemented by "prebind" plugins.
// These plugins are called before a pod being scheduled.
type PreBindPlugin interface {
	Plugin
	PreBind(pc *PluginContext, p *v1.Pod, nodeName string) *Status
}

// PostBindPlugin is an interface that must be implemented by "postbind" plugins.
// These plugins are called after a pod is successfully bound to a node.
type PostBindPlugin interface {
	Plugin
	PostBind(pc *PluginContext, p *v1.Pod, nodeName string)
}

// UnreservePlugin is an interface for Unreserve plugins. This is an informational
// extension point. If a pod was reserved and then rejected in a later phase, then
// un-reserve plugins will be notified. Un-reserve plugins should clean up state
// associated with the reserved Pod.
type UnreservePlugin interface {
	Plugin
	Unreserve(pc *PluginContext, p *v1.Pod, nodeName string)
}

// PermitPlugin is an interface that must be implemented by "permit" plugins.
// These plugins are called before a pod is bound to a node.
type PermitPlugin interface {
	Plugin
	Permit(pc *PluginContext, p *v1.Pod, nodeName string) (*Status, time.Duration)
}

// BindPlugin is an interface that must be implemented by "bind" plugins. Bind
// plugins are used to bind a pod to a Node.
type BindPlugin interface {
	Plugin
	Bind(pc *PluginContext, p *v1.Pod, nodeName string) *Status
}

For enabling or disabling the scheduling framework plugin, we can also use the KubeSchedulerConfiguration resource object above. KubeSchedulerConfiguration) resource object to configure it. The configuration in the following example enables a plugin that implements the reserve and preBind extension points, and disables another plugin, and provides some configuration information for the plugin foo.

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration

...

plugins:
  reserve:
    enabled:
    - name: foo
    - name: bar
    disabled:
    - name: baz
  preBind:
    enabled:
    - name: foo
    disabled:
    - name: baz

pluginConfig:
- name: foo
  args: >
        foo插件可以解析的任意内容

The order in which extensions are called is as follows.

If no corresponding extension is configured for an extension point, the scheduling framework will use the extension from the default plugin
If an extension is configured and activated for an extension point, the scheduling framework will call the extension from the default plugin first, and then the configured extension
The extensions of the default plugin are always called first, and then the extensions of the extension points are called one by one in the order of their activation enabled in KubeSchedulerConfiguration.
You can disable the extensions of the default plugin and then activate the extensions of the default plugin somewhere in the enabled list, which changes the order in which the extensions of the default plugin are called

Suppose the default plugin foo implements the reserve extension point, and we want to add a plugin bar, which we want to be called before foo, then we should disable foo first and then activate bar foo in the same order. An example configuration is shown below.

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration

...

plugins:
  reserve:
    enabled:
    - name: bar
    - name: foo
    disabled:
    - name: foo

In the source code directory pkg/scheduler/framework/plugins/examples there are several demonstration plugins that we can refer to for their implementation.

example

In fact, it is not difficult to implement a scheduler framework plugin, we just need to implement the corresponding extension points and then register the plugin to the scheduler, the following is the plugin registered by the default scheduler at initialization time: pkg/scheduler/framework/plugins/examples.

func NewRegistry() Registry {
	return Registry{
		// FactoryMap:
		// New plugins are registered here.
		// example:
		// {
		//  stateful_plugin.Name: stateful.NewStatefulMultipointExample,
		//  fooplugin.Name: fooplugin.New,
		// }
	}
}

But as you can see there are no plugins registered by default, so to make the scheduler recognize our plugin code, we need to implement a scheduler ourselves, but of course we don’t need to implement this scheduler at all. In the kube-scheduler source file kubernetes/cmd/kube-scheduler/app/server.go there is a NewSchedulerCommand entry function with a list of arguments of type Option, and this Option happens to be the definition of a plugin configuration:

// Option configures a framework.Registry.
type Option func(framework.Registry) error

// NewSchedulerCommand creates a *cobra.Command object with default parameters and registryOptions
func NewSchedulerCommand(registryOptions ...Option) *cobra.Command {
  ......
}

So we can call this function directly as our function entry and pass in our own plug-in as an argument, and there is a function named WithPlugin under the file to create an Option instance.

// WithPlugin creates an Option based on plugin name and factory.
func WithPlugin(name string, factory framework.PluginFactory) Option {
	return func(registry framework.Registry) error {
		return registry.Register(name, factory)
	}
}

So we end up with the following entry function.

func main() {
	rand.Seed(time.Now().UTC().UnixNano())

	command := app.NewSchedulerCommand(
		app.WithPlugin(sample.Name, sample.New), 
	)

	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		_, _ = fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}

}

WithPlugin(sample.Name, sample.New) is the next plug-in we want to implement, from the WithPlugin function parameters can also see that we here sample.New must be a framework.PluginFactory type value and the definition of PluginFactory is a function.

`1`	`type PluginFactory = func(configuration *runtime.Unknown, f FrameworkHandle) (Plugin, error)`

So sample.New is actually the above function, in this function we can get some data in the plug-in and then logical processing can be done, plug-in implementation is shown below, we just simply get the data here to print the log, if you have the actual needs of the data can be processed on the basis of the acquisition, we just implement the PreFilter , Filter, PreBind three extension points, the other can be extended in the same way can be.

// 插件名称
const Name = "sample-plugin"

type Args struct {
	FavoriteColor  string `json:"favorite_color,omitempty"`
	FavoriteNumber int    `json:"favorite_number,omitempty"`
	ThanksTo       string `json:"thanks_to,omitempty"`
}

type Sample struct {
	args   *Args
	handle framework.FrameworkHandle
}

func (s *Sample) Name() string {
	return Name
}

func (s *Sample) PreFilter(pc *framework.PluginContext, pod *v1.Pod) *framework.Status {
	klog.V(3).Infof("prefilter pod: %v", pod.Name)
	return framework.NewStatus(framework.Success, "")
}

func (s *Sample) Filter(pc *framework.PluginContext, pod *v1.Pod, nodeName string) *framework.Status {
	klog.V(3).Infof("filter pod: %v, node: %v", pod.Name, nodeName)
	return framework.NewStatus(framework.Success, "")
}

func (s *Sample) PreBind(pc *framework.PluginContext, pod *v1.Pod, nodeName string) *framework.Status {
	if nodeInfo, ok := s.handle.NodeInfoSnapshot().NodeInfoMap[nodeName]; !ok {
		return framework.NewStatus(framework.Error, fmt.Sprintf("prebind get node info error: %+v", nodeName))
	} else {
		klog.V(3).Infof("prebind node info: %+v", nodeInfo.Node())
		return framework.NewStatus(framework.Success, "")
	}
}

//type PluginFactory = func(configuration *runtime.Unknown, f FrameworkHandle) (Plugin, error)
func New(configuration *runtime.Unknown, f framework.FrameworkHandle) (framework.Plugin, error) {
	args := &Args{}
	if err := framework.DecodeInto(configuration, args); err != nil {
		return nil, err
	}
	klog.V(3).Infof("get plugin config args: %+v", args)
	return &Sample{
		args: args,
		handle: f,
	}, nil
}

The full code can be obtained from the repository https://github.com/cnych/sample-scheduler-framework.

After the implementation is complete, we can compile and package it into an image, and then we can deploy it as a normal application with a Deployment controller. KubeSchedulerConfigurationresource object configuration, you can enable or disable our plugins viaplugins, and you can pass some parameter values to the plugins via pluginConfig`.

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: sample-scheduler-clusterrole
rules:
  - apiGroups:
      - ""
    resources:
      - endpoints
      - events
    verbs:
      - create
      - get
      - update
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - delete
      - get
      - list
      - watch
      - update
  - apiGroups:
      - ""
    resources:
      - bindings
      - pods/binding
    verbs:
      - create
  - apiGroups:
      - ""
    resources:
      - pods/status
    verbs:
      - patch
      - update
  - apiGroups:
      - ""
    resources:
      - replicationcontrollers
      - services
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - apps
      - extensions
    resources:
      - replicasets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - apps
    resources:
      - statefulsets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - policy
    resources:
      - poddisruptionbudgets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - persistentvolumeclaims
      - persistentvolumes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "storage.k8s.io"
    resources:
      - storageclasses
      - csinodes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "coordination.k8s.io"
    resources:
      - leases
    verbs:
      - create
      - get
      - list
      - update
  - apiGroups:
      - "events.k8s.io"
    resources:
      - events
    verbs:
      - create
      - patch
      - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sample-scheduler-sa
  namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: sample-scheduler-clusterrolebinding
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: sample-scheduler-clusterrole
subjects:
- kind: ServiceAccount
  name: sample-scheduler-sa
  namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    schedulerName: sample-scheduler
    leaderElection:
      leaderElect: true
      lockObjectName: sample-scheduler
      lockObjectNamespace: kube-system
    plugins:
      preFilter:
        enabled:
        - name: "sample-plugin"
      filter:
        enabled:
        - name: "sample-plugin"
      preBind:
        enabled:
        - name: "sample-plugin"
    pluginConfig:
    - name: "sample-plugin"
      args:
        favorite_color: "#326CE5"
        favorite_number: 7
        thanks_to: "thockin"    
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-scheduler
  namespace: kube-system
  labels:
    component: sample-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      component: sample-scheduler
  template:
    metadata:
      labels:
        component: sample-scheduler
    spec:
      serviceAccount: sample-scheduler-sa
      priorityClassName: system-cluster-critical
      volumes:
        - name: scheduler-config
          configMap:
            name: scheduler-config
      containers:
        - name: scheduler-ctrl
          image: cnych/sample-scheduler:v0.1.6
          imagePullPolicy: IfNotPresent
          args:
            - sample-scheduler-framework
            - --config=/etc/kubernetes/scheduler-config.yaml
            - --v=3
          resources:
            requests:
              cpu: "50m"
          volumeMounts:
            - name: scheduler-config
              mountPath: /etc/kubernetes

By deploying the above resource object directly, we have deployed a scheduler called sample-scheduler, and we can deploy an application to use this scheduler for scheduling.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-scheduler
  template:
    metadata:
      labels:
        app: test-scheduler
    spec:
      schedulerName: sample-scheduler
      containers:
      - image: nginx
        imagePullPolicy: IfNotPresent
        name: nginx
        ports:
        - containerPort: 80

For enabling or disabling the scheduling framework plugin, we can also use the KubeSchedulerConfiguration resource object above.

Let’s create this resource object directly and check the logging information of our custom scheduler after it is created.

$ kubectl get pods -n kube-system -l component=sample-scheduler
NAME                               READY   STATUS    RESTARTS   AGE
sample-scheduler-7c469787f-rwhhd   1/1     Running   0          13m
$ kubectl logs -f sample-scheduler-7c469787f-rwhhd -n kube-system
I0104 08:24:22.087881       1 scheduler.go:530] Attempting to schedule pod: default/test-scheduler-6d779d9465-rq2bb
I0104 08:24:22.087992       1 plugins.go:23] prefilter pod: test-scheduler-6d779d9465-rq2bb
I0104 08:24:22.088657       1 plugins.go:28] filter pod: test-scheduler-6d779d9465-rq2bb, node: ydzs-node1
I0104 08:24:22.088797       1 plugins.go:28] filter pod: test-scheduler-6d779d9465-rq2bb, node: ydzs-node2
I0104 08:24:22.088871       1 plugins.go:28] filter pod: test-scheduler-6d779d9465-rq2bb, node: ydzs-node3
I0104 08:24:22.088946       1 plugins.go:28] filter pod: test-scheduler-6d779d9465-rq2bb, node: ydzs-node4
I0104 08:24:22.088992       1 plugins.go:28] filter pod: test-scheduler-6d779d9465-rq2bb, node: ydzs-master
I0104 08:24:22.090653       1 plugins.go:36] prebind node info: &Node{ObjectMeta:{ydzs-node3   /api/v1/nodes/ydzs-node3 1ff6e228-4d98-4737-b6d3-30a5d55ccdc2 15466372 0 2019-11-10 09:05:09 +0000 UTC <nil> <nil> ......}
I0104 08:24:22.091761       1 factory.go:610] Attempting to bind test-scheduler-6d779d9465-rq2bb to ydzs-node3
I0104 08:24:22.104994       1 scheduler.go:667] pod default/test-scheduler-6d779d9465-rq2bb is bound successfully on node "ydzs-node3", 5 nodes evaluated, 4 nodes were found feasible. Bound node resource: "Capacity: CPU<4>|Memory<8008820Ki>|Pods<110>|StorageEphemeral<17921Mi>; Allocatable: CPU<4>|Memory<7906420Ki>|Pods<110>|StorageEphemeral<16912377419>.".

You can see that after we create the Pod, the corresponding logs appear in our custom scheduler and above the extension points we defined, proving the success of our example, which can also be verified by looking at the Pod’s schedulerName.

$ kubectl get pods
NAME                                      READY   STATUS    RESTARTS   AGE
test-scheduler-6d779d9465-rq2bb           1/1     Running   0          22m
$ kubectl get pod test-scheduler-6d779d9465-rq2bb -o yaml
......
restartPolicy: Always
schedulerName: sample-scheduler
securityContext: {}
serviceAccount: default
......

In the latest version of Kubernetes v1.17, the Scheduler Framework built-in preselection and preference functions have all been plugged in, so to extend the scheduler we should master and understand the scheduling framework in this way.

Table of Contents