Virtual Memory

The first thing we need to have a clear idea of what we normally call memory is that what we are talking about is actually virtual memory, not manipulation of real physical memory. On a bare metal machine without an operating system installed, the memory we are dealing with is physical memory. As computers evolved, a program called an “operating system” was installed on the computer, which took over all the hardware resources and provided us with a layer of abstraction. We, as software developers, only need to engage with the abstraction layer of the operating system, which simplifies many of the problems caused by hardware inconsistencies.

Virtual memory is the abstraction layer that the operating system provides to us for the actual physical memory. Most modern operating systems are multitasking systems, and each of these “tasks”, generally called processes, has its own separate process address space. In other words, each process looks at virtual memory as if it owns all the space, i.e. as if it were the only process on the entire system. If it wants to communicate with other processes, it can do so using sockets, for example. This brings us to the cleverness of the design of operating systems and computer networks. Sockets are designed as a way for two processes to communicate, whether they are two processes on the same machine or two processes on two machines in a network, using this same abstract concept to communicate, shielding a lot of concrete details.

Compilation models for C and C++

Here we focus on the C case. C++ follows the C model without major differences.

For the C compiler, it can compile a separate .c file independently, which is produced as an .o file. Finally we link them into an executable file by means of a linker.

And in the middle of this arises the question, if each .c file is compiled separately, how does it know the information of other files? Let’s say a function in file A needs to call another function in file B.

To give a concrete example.

1
2
3
4
// foo.c
int foo(int i) {
    return i + 42;
}
1
2
3
4
// bar.c
int bar(int i) {
    return foo(i) + 24;
}

Here a function in bar.c references a function in foo.c, so how do we tell the compiler about the function foo? The final design is to pretend we have this function .

1
2
3
4
5
6
// bar.c
int foo(int i); // Declare the function foo, telling the compiler that we have it.

int bar(int i) {
    return foo(i) + 24;
}

And as more of this information is declared, we can put all of this information into a separate file and include it in, forming the form of a header file.

1
2
3
4
5
// bar.c
#include "foo.h"
int bar(int i) {
    return foo(i) + 24;
}

ELF file format

The source file is compiled by the compiler to form an object file, which is a binary file with only 0’s and 1’s. Different operating systems have different formats for object files, and only the ELF format used on Linux will be discussed below.

The Object file we compile from a .c file needs to have at least two parts of information in it.

  • Code, which represents the binary sequence of instructions.
  • Data, which is a binary sequence representing global data.

By design, we put information with the same attributes together in what is called a segment. And since the data in ELF files are binary, there is no way to distinguish them directly. To solve this, we separate out another area called the segment table, which records the mapping of each segment’s name to the area it occupies. In order to find the segment table, we use a fixed-length area at the beginning of an ELF file as the header, where each offset represents a fixed meaning. Using this, we can record the start position of the segment table in the header, so we can get all the information about the segments.

Here are some of the more important segments.

segment name role
.text code
.data Initialized global and local static variables
.bss Uninitialized global and local static variables
.rodata Read-only data, such as string constants and global const variables

We can view the information with the objdump and readelf tools. As an example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// demo.c
int foo();

int global1 = 42;
int global2;
const int global3 = 43;
int main() {
    static int static1 = 43;
    static int static2;
    return 12;
}

Compile with the following command.

1
gcc -c demo.c

View file header.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
readelf -h demo.o

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          640 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           64 (bytes)
  Number of section headers:         13
  Section header string table index: 12

View information about each segment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
objdump -h demo.o

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000000f  0000000000000000  0000000000000000  00000040  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000008  0000000000000000  0000000000000000  00000050  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000008  0000000000000000  0000000000000000  00000058  2**2
                  ALLOC
  3 .rodata       00000004  0000000000000000  0000000000000000  00000058  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 .comment      0000002c  0000000000000000  0000000000000000  0000005c  2**0
                  CONTENTS, READONLY
  5 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00000088  2**0
                  CONTENTS, READONLY
  6 .note.gnu.property 00000020  0000000000000000  0000000000000000  00000088  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 .eh_frame     00000038  0000000000000000  0000000000000000  000000a8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

View the code segment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
objdump -s -d demo.o

Contents of section .text:
 0000 f30f1efa 554889e5 b80c0000 005dc3    ....UH.......].
Contents of section .data:
 0000 2a000000 2b000000                    *...+...
Contents of section .rodata:
 0000 2b000000                             +...
Contents of section .comment:
 0000 00474343 3a202855 62756e74 75203131  .GCC: (Ubuntu 11
 0010 2e332e30 2d317562 756e7475 317e3232  .3.0-1ubuntu1~22
 0020 2e303429 2031312e 332e3000           .04) 11.3.0.
Contents of section .note.gnu.property:
 0000 04000000 10000000 05000000 474e5500  ............GNU.
 0010 020000c0 04000000 03000000 00000000  ................
Contents of section .eh_frame:
 0000 14000000 00000000 017a5200 01781001  .........zR..x..
 0010 1b0c0708 90010000 1c000000 1c000000  ................
 0020 00000000 0f000000 00450e10 8602430d  .........E....C.
 0030 06460c07 08000000                    .F......

Disassembly of section .text:

0000000000000000 <main>:
   0:   f3 0f 1e fa             endbr64
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   b8 0c 00 00 00          mov    $0xc,%eax ; 0xc is 12, i.e. returns 12
   d:   5d                      pop    %rbp
   e:   c3                      ret

The essence of linking is to put together a single Object file. If the variables and functions in each Object file were independent, this would be a very simple process. However, in practice it is often the case that this file references a function in another file, and we must solve this problem. If a function in foo.c calls an external function bar(), the jmp instruction we generate must point to the correct address.

So one of the main responsibilities of the connector is to determine the actual address of each symbol. For this purpose, each ELF file maintains a symbol table that represents a mapping of all symbols’ names and its related information (addresses, types, etc.).

As an example.

1
2
3
int i = 42;
int foo(int);
void bar() { foo(i); }

We check it for all symbols.

1
2
3
4
nm demo.o
0000000000000000 T bar
                 U foo
0000000000000000 D i

T represents symbols defined in the code segment, D represents symbols defined in the data segment, and U represents undefined symbols that need to be processed by the linker later.

C++ Name Mangling

Since function overloading exists in C++, multiple functions can be renamed. If this is not handled, then symbolic conflicts can occur. For this, the C++ compiler adds the function arguments to the symbolic name as well. Different ABI’s have different rules. For the Itanium ABI used in Linux, the above example is compiled as C++ code with the following symbols.

1
2
3
4
5
nm demo.o

0000000000000000 D i
0000000000000000 T _Z3barv
                 U _Z3fooi

The detailed rules can be found in the Itanium ABI documentation, where it is sufficient to know that name mangling exists. C++ also provides a way to link C++ code without mangle symbols in order to make C code link correctly.

1
2
3
extern "C" {
    int foo();
}

Weak symbols and strong symbols

The types of symbols are classified as strong symbol and weak symbol. If there are two symbols with the same name, take the definition of strong symbol, multiple strong symbols with the same name will conflict and report an error, multiple weak symbols with the same name is undefined behavior.

The compiler defaults functions and initialized global variables to strong symbols, and uninitialized global variables to weak symbols; GCC also provides an extension to define any symbol as a weak symbol.

1
__attribute__((weak)) weak = 42;

Weak symbols are useful for libraries. Weak symbols defined by libraries can be overridden by strong symbols defined by the user, allowing programs to use custom library functions.

For example, the definition of operator new in libcxx.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Implement all new and delete operators as weak definitions
// in this shared library, so that they can be overridden by programs
// that define non-weak copies of the functions.

_LIBCPP_WEAK
void *
operator new(std::size_t size) _THROW_BAD_ALLOC
{
    ...
}

Users can provide their own operator new functions, overriding the self-contained library functions.

A static library can be seen as a collection of Object files. A static library can be created with the following command.

1
2
3
gcc -c foo.c
gcc -c bar.c
ar -rc libtest.a foo.o bar.o

In fact, ar itself is a tool for packaging multiple files together, nothing unusual.

The details of linking static libraries with other object files into executable files belong to the internal implementation of the connector, so we won’t discuss it too much here, but the general idea is to merge all the .o files referenced in the static library (.a) together, which is the reason why the statically linked program is so large.

Global objects in C++

We usually think of C and C++ programs as starting with main, but before that we need to initialize the process environment, call the global object constructor, and many other things in order for the program to run properly. ELF also defines two special segments for constructing and destructing global objects.

  • .init holds the initialization code for the process, and the constructor calls for the global object.
  • .fini holds the termination code of the process, and the destructor calls of the global object.

Dynamic linking

The idea of static linking is simple and effective, but its biggest problem is that it is too large. For this reason, dynamic linking was created.

The basic idea of dynamic linking is that the composition of the link is pushed back from before the original program is loaded to when it is loaded.

Location-independent code

One of the main goals of shared libraries is to allow multiple processes to use the same library code, thus saving hard disk and memory. A simple idea is to pre-allocate a dedicated address space and always load the shared library into this location. However, this is not a good use of the address space and also tends to cause a lot of memory fragmentation. To solve these problems, modern operating systems use a technique called location-independent code.

Data References

The compiler creates a table at the beginning of the data segment called the Global Offset Table. This table is an array, and each table entry is the address of the referenced global data target (function or global variable). And because the distance between a data segment and a code segment is always a constant, we can find the corresponding table entry by using the instruction pointer (%rip) and an offset.

As an example.

1
2
3
foo:
    mov 0x2008b9(%rip), %rax ; Find the corresponding table entry in GOT, note the address.
    addl $0x1, (%rax) ; Because it is an address, it has to be dereferenced first and then manipulated.

The equivalent C code is as follows.

1
2
%rax = GOT[3]; // GOT[3] What is stored is the actual address of the global variable.
*rax += 1;

code reference

Suppose a program calls a function defined by a shared library. The compiler cannot predict the actual address of this function because it can be loaded to an arbitrary location. To solve this problem, we use a technique called Lazy Binding, which defers the binding of the address until the first call.

Lazy Binding is implemented through a combination of a Global Offset Table and a Procedurev Linkage Table. The Procedure Linkage Table PLT is also an array that acts like a proxy whose table entries are responsible for calling the actual called function, and the GOT, as mentioned above, holds the actual address of the called function (note that it needs to be resolved at runtime).

As an example.

  1. When a user program first calls a function foo that is defined in a shared library, it enters the corresponding PLT table entry. This PLT table entry is a jmp instruction that jumps to the address defined in the corresponding GOT table entry.
  2. Since this is the first call, the GOT table entry is not the real address of the function foo, but the next address of the previous jmp instruction. The instruction at this address is a push instruction, representing the ID of the function foo, followed by another jmp instruction, this time jumping to PLT[0], which is the call to the dynamic connector.
  3. The dynamic connector finds the actual location of the function foo and rewrites its corresponding GOT table entry.
  4. From then on, all subsequent calls will jump to the corresponding location after step 1.

Symbol Visibility

According to the original linking model, all symbols in a shared library should be externally publicly accessible. But there are obviously many problems with this design, the loss of good encapsulation is only one, but most importantly, when the number of symbols is very large, as in some C++ libraries that abuse templates (yes I’m talking about boost), it can cause a significant increase in loading time.

The compiler provides a command line argument for this purpose, which makes all symbols invisible externally by default.

1
-fvisibility=hidden

Then we can mark the symbols we need to export with one of the GCC provided extensions.

1
2
3
4
void __attribute__((visibility("default"))) Exported()
{
    // ...
}

This reduces the size of the dynamic library (by reducing the symbol table) and reduces symbol collisions.