This article provides an easy and quick introduction to the principles and implementation of the CFS scheduler, a fully fair class scheduling class in the Linux kernel.

Principle

The core principle of the CFS (Completely Fair Scheduler) is simple: each process is given as “fair” a runtime as possible. So the process that has run the least in the past is chosen to run each time. Of course, as a scheduler, it has to meet much more than that, and a number of features such as Linux’s preemptive processes and support for priorities in the Completely Fair class fair_sched_class make the design of CFS much more complex than simply picking the least amount of runtime. But don’t fret, let’s take a step-by-step approach to understanding the principles of CFS.

Minimum runtime

In order to avoid excessive preemptions, the Linux kernel sets a minimum runtime (or runtime granularity) for each Task (process), during which the process’s CPU resources cannot be preempted. Unless the process gives up CPU or executes a blocking system call, it can generally execute for a minimum of this time. The minimum runtime can be viewed with the kernel parameter sched_min_granularity_ns.

1
2
cat /proc/sys/kernel/sched_min_granularity_ns
3000000

Time Slices

CFS ensures that higher priority processes get more CPU time by introducing weights, and time slices are allocated between processes in proportion to their weights. The runtime allocated to processes during the scheduling cycle is calculated according to this formula.

1
Runtime tn = Scheduling cycle T * Process weights w / Sum of weights of all processes in the run queue S**

The weight is a quantity related to the nice value, now defined in the Linux kernel in kernel/sched/core.c#L9516, with the following values

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

The weight for a nice value of 0 is NICE_0_LOAD = 1024, and for every difference in nice value of 1, the weight is approximately 1.25 times different. The 1.25 calculation here is based on the design that a 1 difference in nice value results in a 10% difference in running time.

Virtual Runtime

After solving the problem of how much time to allocate to each process, there is still the question of who will run next, and how can higher priority processes be guaranteed more runtime, given that the CFS implementation is based on the principle of “complete fairness”? This introduces a concept called “virtual runtime”. Suppose we want the following scenario.

1
High priority process runs for 15ms = low priority process runs for 5ms

Then we set this equal amount as a “virtual runtime”, so that the high priority process will almost always run three times as long as the low priority. In Linux, this virtual runtime is related to the real runtime by the weights just mentioned.

1
virtual runtime = real runtime * NICE_0_LOAD / weight of the process

Linux defines the vruntime variable in the scheduling element (include/linux/sched.h#L453), which is used to record the cumulative virtual runtime of a process, with the following code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
struct sched_entity {
    /* For load-balancing: */
    struct load_weight        load;
    struct rb_node            run_node;
    struct list_head        group_node;
    unsigned int            on_rq;

    u64                exec_start;
    u64                sum_exec_runtime;
    u64                vruntime;
    u64                prev_sum_exec_runtime;

    u64                nr_migrations;

    struct sched_statistics        statistics;
        ...
}

The vruntime variable is updated after each run of the process. As for how to pick the process with the least vruntime, this will be done by the red-black tree.

Red-black Trees

A red-black tree is a self-balancing binary tree in which the leftmost leaf node is always the node with the smallest key. Red-black trees are guaranteed to always be balanced between these atomic operations by operations on insertion (update) and deletion.

See https://www.jianshu.com/p/e136ec79235c for details of the operations

The kernel arranges the processes on the scheduling queue into a red-black tree by vruntime, so that picking the process with the smallest vruntime becomes a simple matter of picking the leftmost leaf node.

Minimum vruntime

The kernel maintains a min_vruntime variable in addition to the red-black tree to keep track of the minimum virtual runtime at this point in time.

Why is this min_vruntime needed?

  • A new process is added to the red-black tree, so what should its vruntime be?
  • A process that has been dormant for 100,000 years waits for the event it requested and is now woken up, does it keep its original vruntime?
  • The process is load balanced and migrated to another CPU (another scheduling queue), so how does its vruntime change?

It can be seen that the role of min_vruntime is to help the Linux kernel in these situations.