mmap Basic Concepts

mmap is a memory mapping method that maps a file to a process address space, mapping a file’s disk address to a process virtual address. Once such a mapping relationship is implemented, the process can read and write to this section of memory using pointers, and the system will automatically write back dirty pages to the corresponding file disk, i.e., it is done with the file without having to call read, write, and other system calls. On the contrary, changes to this area in kernel space are directly reflected in user space, thus allowing file sharing between processes.

The operating system provides this set of mmap companion functions

1
2
3
void *mmap(void *start, size_t length, int prot, int flags, int fd, off_t offset);
int munmap( void * addr, size_t len);
int msync( void *addr, size_t len, int flags);

mmap in Java

Native read and write methods in Java can be roughly divided into three types: normal IO, FileChannel, and mmap. For example, FileWriter and FileReader exist in the java.io package, and they belong to normal IO; FileChannel exists in the java.nio package, and is the most common file manipulation class in Java; and today’s main character, mmap, is a special way of reading and writing files derived from the map method called by FileChannel. The main character, mmap, is a special way of reading and writing files derived from the map method called by FileChannel, which is called memory mapping.

The way mmap is used.

1
2
FileChannel fileChannel = new RandomAccessFile(new File("db.data"), "rw").getChannel();
MappedByteBuffer mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_WRITE, 0, filechannel.size();

MappedByteBuffer is the mmap operation class in Java.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// 写
byte[] data = new byte[4];
int position = 8;
// 从当前 mmap 指针的位置写入 4b 的数据
mappedByteBuffer.put(data);
// 指定 position 写入 4b 的数据
MappedByteBuffer subBuffer = mappedByteBuffer.slice();
subBuffer.position(position);
subBuffer.put(data);

// 读
byte[] data = new byte[4];
int position = 8;
// 从当前 mmap 指针的位置读取 4b 的数据
mappedByteBuffer.get(data)
// 指定 position 读取 4b 的数据
MappedByteBuffer subBuffer = mappedByteBuffer.slice();
subBuffer.position(position);
subBuffer.get(data);

mmap is not a silver bullet

A big motivation for writing this article came from a lot of misconceptions about mmap in the web. When I first learned about mmap, many articles mentioned that mmap was suitable for handling large files, but in retrospect, this is a ridiculous view, and I hope that this article will clarify what mmap is supposed to be.

The coexistence of FileChannel and mmap probably means that both have their appropriate use cases, and they do. When you look at them, you can think of them as two tools for implementing file IO, and there is no good or bad tool per se.

mmap vs FileChannel

This section details the similarities and differences between FileChannel and mmap for file IO.

pageCache

Both FileChannel and mmap reads and writes go through the pageCache, or more precisely the cache part of memory observed by vmstat, rather than the user space memory.

1
2
3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 4622324  40736 351384    0    0     0     0 2503  200 50  1 50  0  0

I have not researched whether this part of memory mapped by mmap can be called pageCache, but in the OS view, there is not much difference between them, as this part of the cache is controlled by the kernel. Later in this article, we will also call the memory from mmap a pageCache.

Missing page interrupts

Readers with a basic understanding of Linux file IO may not be too familiar with the concept of page-out interrupts. mmap and FileChannel both read and write to files in a page-out interrupt fashion.

Take the example of mmap reading a 1G file, fileChannel.map(FileChannel.MapMode.READ_WRITE, 0, _GB); The mapping is a minimal consumption operation, but it does not mean that the 1G file is read into the pageCache. pageCache.

1
2
3
4
5
FileChannel fileChannel = new RandomAccessFile(file, "rw").getChannel();
MappedByteBuffer map = fileChannel.map(MapMode.READ_WRITE, 0, _GB);
for (int i = 0; i < _GB; i += _4kb) {
	temp += map.get(i);
}

MappedByteBuffer#load method, the load method is also triggered by a per-page access break

The following is the gradual growth of the pageCache, which has grown by about 1.034G in total, indicating that the contents of the file are fully loaded at this point.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 4824640   1056 207912    0    0     0     0 2374  195 50  0 50  0  0
 2  1      0 4605300   2676 411892    0    0 205256     0 3481 1759 52  2 34 12  0
 2  1      0 4432560   2676 584308    0    0 172032     0 2655  346 50  1 25 24  0
 2  1      0 4255080   2684 761104    0    0 176400     0 2754  380 50  1 19 29  0
 2  3      0 4086528   2688 929420    0    0 167940    40 2699  327 50  1 25 24  0
 2  2      0 3909232   2692 1106300    0    0 176520     4 2810  377 50  1 23 26  0
 2  2      0 3736432   2692 1278856    0    0 172172     0 2980  361 50  1 17 31  0
 3  0      0 3722064   2840 1292776    0    0 14036     0 2757  392 50  1 29 21  0
 2  0      0 3721784   2840 1292892    0    0   116     0 2621  283 50  1 50  0  0
 2  0      0 3721996   2840 1292892    0    0     0     0 2478  237 50  0 50  0  0

Two details.

  1. the mmap mapping process can be interpreted as a lazy load, only get() will trigger a page out interrupt
  2. the pre-reading size is determined by the OS algorithm and can be treated as 4kb by default, i.e. if you want lazy loading to become real-time loading, you need to iterate through it once according to step=4kb

The same principle of FileChannel out-of-page interrupt, both need to use PageCache as a springboard to finish reading and writing files.

Number of memory copies

Many argue that mmap makes one less copy than FileChannel, but I personally think we need to differentiate between scenarios.

For example, if the requirement is to read an int from the first address of a file, the two links are actually the same: SSD -> pageCache -> application memory, and mmap does not make one less copy.

But if the requirement is to maintain a 100M multiplexed buffer, and it involves file IO, mmap can be used directly as a 100M buffer, instead of maintaining another 100M buffer in the process memory (user space).

User state vs. kernel state

For security reasons, the operating system encapsulates some of the underlying capabilities and provides system calls for users to use. Here is the problem of switching between “user state” and “kernel state”, and I think this is where many people’s concepts are blurred, so I’ll sort out my personal knowledge here.

Let’s look at FileChannel first, the following two pieces of code, who do you think is faster?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// 方法一: 4kb 刷盘
FileChannel fileChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer byteBuffer = ByteBuffer.allocateDirect(_4kb);
for (int i = 0; i < _4kb; i++) {
    byteBuffer.put((byte)0);
}
for (int i = 0; i < _GB; i += _4kb) {
    byteBuffer.position(0);
    byteBuffer.limit(_4kb);
    fileChannel.write(byteBuffer);
}

// 方法二: 单字节刷盘
FileChannel fileChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer byteBuffer = ByteBuffer.allocateDirect(1);
byteBuffer.put((byte)0);
for (int i = 0; i < _GB; i ++) {
    byteBuffer.position(0);
    byteBuffer.limit(1);
    fileChannel.write(byteBuffer);
}

Using method 1: 4kb buffering to swipe the disk (regular operation), it only took 1.2s to finish writing 1G on my test machine, while method 2, which does not use any buffering, was almost straightforwardly stuck, the file grew very slowly, and after waiting for 5 minutes before it was finished, the test was interrupted.

Using a write buffer is a very classic optimization trick. Users only need to set a write buffer of 4kb integer multiples to aggregate small data writes, so that the data is swiped from the pageCache in as many integer multiples of 4kb as possible to avoid write amplification problems. But this is not the focus of this section, have you ever thought, pageCache is actually a layer of buffer itself, the actual write 1byte is not synchronized with the disk, the equivalent of writing the memory, pageCache disk by the operating system’s own decision. Then why is method two so slow? **The main reason is that the underlying read/write associated system calls of the filechannel need to switch between kernel and user states. Method 2 switches 4096 times more than method 1, and the state switching becomes the bottleneck, resulting in a serious time consumption.

To summarize the key points at this stage, the act of setting user write buffers in DRAM has two implications: 1.

  1. convenient to do 4kb alignment, ssd swipe disk friendly
  2. reduce the number of user state and kernel state switch, cpu friendly

However, unlike mmap, the underlying mapping capability does not involve switching between kernel and user states, note that there is still nothing to do with memory copying here, and the root cause of the state not switching is the system call itself associated with mmap. It is also very easy to verify this, as we use the mmap implementation of method 2 to see how fast.

1
2
3
4
5
FileChannel fileChannel = new RandomAccessFile(file, "rw").getChannel();
MappedByteBuffer map = fileChannel.map(MapMode.READ_WRITE, 0, _GB);
for (int i = 0; i < _GB; i++) {
		map.put((byte)0);
}

On my test machine, it took 3s, which is slower than a FileChannel + 4kb buffered write, but far faster than a FileChannel writing a single byte.

mmap details added

copy on write mode

We notice that the first parameter of public abstract MappedByteBuffer map(MapMode mode, long position, long size), MapMode, actually has three values, and when surfing the web, we hardly find any articles explaining MapMode. MapMode has three enumerated values READ_WRITE, READ_ONLY, PRIVATE, most of the time the one used is probably READ_WRITE, while READ_ONLY is just a restriction of WRITE, which is easy to understand, but this PRIVATE seems to have a mysterious veil on it. But this PRIVATE seems to have a veil of mystery.

In fact, the PRIVATE mode is the copy on write mode of mmap, and when using MapMode.

  1. any other modifications to the file will be directly reflected in the current mmap map. 2.
  2. private mmap’s own put behavior afterwards will trigger a copy, forming its own copy, and any changes will not be swiped to the file and will no longer sense changes to that page of the file.

Commonly known as: copy on write.

What is the point of this? The point is that any changes will not be swiped back to the file. For one thing, you get a copy of the file, and if you happen to need it, you can use PRIVATE mode to map it directly, and for another, it’s a little exciting because you get a real PageCache and don’t have to worry about it being overhead by the OS swiping the disk. The remaining 1G can only be used by the kernel state, and if you want to use it for user state programs, you can use the copy on write mode of mmap, which will not take up your in-heap or out-of-heap memory.

Reclaiming mmap memory

To correct an error in a previous blog post about mmap memory recycling, recycling mmap is simple

1
((DirectBuffer) mmap).cleaner().clean();

The life of an mmap can be simply divided into: map (mapping), get/load (missing page interrupt), and clean (recycling). A useful trick is to dynamically allocate memory map areas that can be reclaimed asynchronously after they have been read.

mmap usage scenarios

Using mmap to handle frequent reads and writes of small data

If IO is very frequent but data is very small, it is recommended to use mmap to avoid tangent problems caused by FileChannel. For example, appending writes to index files.

mmap caching

When using FileChannel for file reads and writes, a piece of write cache is often needed for aggregation purposes, most often using in-heap/out-of-heap memory, but they both have the problem that when the process hangs, the in-heap/out-of-heap memory is immediately lost, and this part of the data that did not fall on the disk is lost. Using mmap as a cache, on the other hand, stores directly in the pageCache and does not result in data loss, although this only circumvents the process being killed, not the power loss.

Reading and writing small files

Contrary to many statements on the web, mmap is particularly suitable for sequential reading and writing due to its non-tangent state, but due to the size limitation in sun.nio.ch.FileChannelImpl#map(MapMode mode, long position, long size), only an int value can be passed, so If 2G is used as the threshold for large or small files, it is generally advantageous to use mmap to read and write files smaller than 2G. This is also exploited in RocketMQ, where the commitLog size is sliced by 1G to make it easier to use mmap. By the way, I forgot to mention that RocketMQ and other message queues use mmap all the time.

Reads and Writes in a cpu crunch

In most scenarios, the combination of a FileChannel and a read/write buffer has an advantage over mmap, or a tie, but in cpu-critical reads and writes, using mmap for reads and writes is often optimized, based on the fact that mmap does not overwhelm the cpu with user-state and kernel-state switching (but then takes on the overhead of dynamic mapping and asynchronous memory reclamation).

Special hardware and software factors

For example, persistent memory Pmem, different generations of SSDs, different CPUs, different cores, different file systems, different file system mounts, etc. all affect the speed of mmap and filechannel read/write, because they correspond to different system calls. Only after benchmarking will we know how fast or slow they are.