Anyone who knows a little bit about Linux knows that Linux divides the physical Random Access Memory (RAM) into 4KB-sized memory blocks by page, and the Swapping mechanism we are going to introduce today is closely related to memory, which is the process of the operating system copying the contents of physical memory pages to the swap space on the hard disk to release the memory. The physical memory and the swap partition on the hard disk make up the virtual memory available on the operating system, and these swap spaces are pre-configured by the system administrator.

It is because all processes on Linux deal with physical memory indirectly through a layer of abstraction called virtual memory, and Swapping takes full advantage of this feature by giving the application the illusion that the operating system has plenty of memory, not realizing that some of the virtual memory it uses is actually on disk, which is very slow to read and write because of the huge difference in read and write speeds between memory and disk.

Random access to 4KB of data in an SSD takes 1,500 times longer than accessing main memory, and mechanical disks take 100,000 times longer to seek than accessing main memory.

Such a huge performance difference makes it possible for the process that triggers Swapping to experience performance loss. Frequent swapping in and out of the same page can lead to extremely noticeable performance jitter, and without the appropriate background knowledge, it may be difficult to find the root cause of similar problems, such as MySQL’s frequent swapping in and out of memory pages when NUMA is misconfigured, affecting quality of service.

Linux provides two different ways to enable Swapping, Swap Partition and Swapfile.

  • A Swap partition is a separate area on the hard disk that will only be used for swap partitions and no other files can be stored on it, and we can use the swapon -s command to see the swap partitions on the current system.
  • Swap files are special files in the file system, which are also not much different from other files in the file system.

The size of the swap partition needs to be set manually by the system administrator, but it is better to set different swap partition sizes for different scenarios, e.g. the desktop system swap partition can be twice the size of the system memory, which allows us to run more applications at the same time; the server swap partition should be turned off or a small number of swap partitions should be used, but once the swap partition is enabled, monitoring should be introduced to monitor the performance of the application.

Now that we have a good understanding of Swapping on Linux, we will return to the question we want to discuss in this article - “Why Linux needs Swapping” - and we will cover the two aspects of the problem, trigger entry and execution path that Swapping solves.

  • Swapping can directly swap out of memory relatively little-used pages in a process and immediately allocate memory to the executing process.
  • Swapping can swap idle pages in a process out of memory, preparing the memory for future use by other processes.

Out of memory

When the system needs more memory than the available physical memory, the kernel will swap the infrequently used memory pages to disk to make memory available for the current process and ensure the availability of the executing process, this memory recovery process is forced Direct Page Reclaim.

Direct memory reclamation is triggered when Linux calls __alloc_pages_nodemask to request a new memory page, which first looks in the free If no page is available, it will go to __alloc_pages_slowpath Allocating memory pages, as opposed to looking for memory directly from the free list as well, this function will allocate memory through the following steps.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) {
    ...
    if (alloc_flags & ALLOC_KSWAPD)
        wake_all_kswapds(order, gfp_mask, ac);

    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page) goto got_pg;

    if (can_direct_reclaim && (costly_order || (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {
        page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, INIT_COMPACT_PRIORITY, &compact_result);
        if (page) goto got_pg;
        ...
    }

 retry:
    page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, &did_some_progress);
    page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, compact_priority, &compact_result);
    page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 got_pg:
    return page;
}
  1. wake up the kswapd thread to reclaim memory in the background and try to call get_page_from_freelist to quickly fetching memory pages.
  2. Expensive memory requests will first call __alloc_pages_direct_compact to try to compress the memory pages and call get_pages_from_freelist in the compressed call [ ``get_page_from_freelist`’’ (https://elixir.bootlin.com/linux/v5.9.6/source/mm/page_alloc.c#L3729) to find free memory pages in the compressed memory.
  3. call __alloc_pages_direct_reclaim to directly reclaim and allocate new memory pages.
  4. call __alloc_pages_direct_compact again to try to compress memory and get free memory pages.
  5. call __alloc_pages_may_oom to allocate memory, which will trigger an out-of-memory warning to randomly kill several processes on the operating system if the memory allocation fails. of several processes on the operating system.

Although the steps for acquiring memory pages have been heavily truncated, they show several common ways Linux can acquire memory when it is also running low: memory compression, direct recycling, and triggering an out-of-memory error to kill some processes.

Memory Idle

The large amount of memory used by applications during startup is often not used after startup, and through a daemon running in the background, we can swap this once-used memory to disk to make room for other memory requests. kswapd is the Linux daemon for page replacement, and it is also the main process responsible for swapping idle memory. It reclaims free memory from memory pages when it falls below a certain level to ensure that other processes in the system can get the requested memory as soon as possible, as shown in the following figure.

When the free page is less than WMARK_LOW, the kswapd process will start working and it will swap the memory pages to disk until the free page level returns to WMARK_HIGH, but when the free page level is below WMARK_MIN it will trigger the direct memory reclamation mentioned in the previous section, while a level above WMARK_HIGH means that there is enough free memory and no reclamation is needed.

The Linux operating system uses the Least Recently Used (LRU) algorithm to replace pages in memory, and each zone in the system holds active_list and inactive_list chains in memory, where the former contains active memory pages and the latter stores memory pages that are candidates for recycling. In addition to this, Linux also divides lru_list into the following categories based on the characteristics of the memory pages.

1
2
3
4
5
6
7
8
enum lru_list {
	LRU_INACTIVE_ANON = LRU_BASE,
	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
	LRU_UNEVICTABLE,
	NR_LRU_LISTS
};

Those containing ANON indicate anonymous memory pages, which store contents such as process stacks unrelated to files, while those containing FILE indicate file-related memory, that is, memory corresponding to program files or data, and the final LRU_UNEVICTABLE indicates memory pages that are forbidden to be reclaimed.

Whenever a memory page is accessed, Linux moves the accessed memory page to the head of the chain, so the ‘oldest’ memory page in the chain is at the end of the active chain. The role of the daemon kswapd is to balance the length of the two chains and move the memory page at the end of the active chain to the head of the inactive chain to be recycled, while the function shrink_zones is responsible for recycling the inactive memory pages in the LRU chain.

Summary

Many people think that when the system runs out of memory it should immediately trigger Out of memory (OOM) and kill the process, but Swapping actually provides an alternative for system administrators to use the swap space on disk to avoid programs being exited outright, in exchange for partial availability of the service at the cost of reduced quality of service. The Linux Swapping is a mechanism that exists for two common situations: low memory and idle memory

  • Swapping can directly swap out of memory relatively little-used pages in a process: when the system needs more memory than the available physical memory, the kernel swaps the infrequently used memory pages in memory to disk to make memory available for the current process, ensuring the availability of the executing process.
  • Swapping can swap out of memory idle pages in a process: a large amount of memory used by an application during the startup phase is often not used after it starts, and through a daemon running in the background, we can swap this part of memory that is only used once to disk to reserve space for other memory requests.

There is a lot of discussion about whether Swapping should be enabled or disabled, and we should not make a blanket statement that Swapping must be enabled or disabled today. For example, if Kubernetes requires Swapping to be disabled, we should follow the community’s recommendation to turn it off on machines deploying Kubernetes. To conclude, let’s look at some more open and relevant questions, and interested readers can think carefully about the following.

  • What parameters does Linux provide to control the behavior of Swapping?
  • In which scenarios is it desirable to trade off partial availability at the cost of reduced quality of service?