Since Go 1.12, people have been having problems with monitoring false positives. The reason for this is that Go changed the memory reclamation policy used by the madvise system call from MADV_DONTNEED to MADV_FREE starting in 1.12. From the available documentation, it appears that RSS, the most commonly used memory monitoring metric, does not reflect the portion of memory in the process that is not reclaimed by the OS. Naturally, there are some suggestions that RSS should be replaced with a more appropriate metric, such as PSS or even USS. This leads to some tricky questions, as PSS and USS are not as common as RSS, and the documentation does not say much about how they actually reflect memory consumption. Are they really more appropriate than RSS?

What is RSS/PSS/USS

In order to make the problem clear, it is always necessary to explain what the problem is. The question always searches out a whole bunch of repeatedly copied explanations.

1
2
3
4
5
6
7
8
VSS, USS, PSS, and RSS are four indicators for measuring memory usage:

- VSS: Virtual Set Size, virtual memory footprint, including shared libraries.
- RSS: Resident Set Size, actual physical memory usage, including shared libraries.
- PSS: Proportion Set Size, the actual physical memory used, shared libraries, etc. are allocated proportionally.
- USS: Unique Set Size, the physical memory occupied by the process, does not calculate the memory usage of the shared library.
- 
Generally we have VSS >= RSS >= PSS >= USS.

From these descriptions, the overall impression is that USS is better than PSS, PSS is better than RSS, and VSS is basically unusable: because VSS reflects the virtual address space requested and not returned by the current process, RSS contains the so-called shared libraries, PSS shares the size of the shared libraries in proportion to the shared processes, and USS does not count the memory of the shared libraries directly.

By this definition, the difference between RSS, PSS, and USS is only in the shared libraries, but for statically linked programs like Go, shared libraries are not that common. A reasonable doubt is that in most cases: RSS == PSS == USS.

MADV_DONTNEED vs MADV_FREE

For functions like memory consumption that are directly tied to the kernel, a good kernel will naturally log this information somewhere for review. On Linux, for example, RSS is usually placed in /proc/[pid]/status, and when a running application wants to query its own consumption behavior, it can even use /prof/self/status to read its own consumption status directly, like cat itself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ cat /proc/self/status
Name:   cat
...
Pid:    3509083
...
VmPeak:    11676 kB
VmSize:    11676 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:       596 kB
VmRSS:       596 kB
RssAnon:              68 kB
RssFile:             528 kB
RssShmem:              0 kB

The meaning of each variable can be found in the man page man proc, for example VmRSS refers to the value of RSS, and VmSize is the value of VSS, and so on. Of course, the contents of /proc/[pid]/status are embellished, so you can get this information directly from the more concise /proc/[pid]/stat statistics file if you’re really programmatic. Let’s take RSS as an example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
var pageSize = syscall.Getpagesize()

// rss returns the resident set size of the current process, unit in MiB
func rss() int {
    data, err := ioutil.ReadFile("/proc/self/stat")
    if err != nil {
        log.Fatal(err)
    }
    fs := strings.Fields(string(data))
    rss, err := strconv.ParseInt(fs[23], 10, 64)
    if err != nil {
        log.Fatal(err)
    }
    return int(uintptr(rss) * uintptr(pageSize) / (1 << 20)) // MiB
}

For memory management system calls on Linux, the memory from mmap plus PROT_READ and PROT_WRITE will result in a missing page error. But eventually the OS will actually allocate this memory to the process anyway. The difference between using MADV_DONTNEED with madvise and MADV_FREE can be directly measured by the rss() method above. For example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package main

import (
    "flag"
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "runtime"
    "strconv"
    "strings"
    "syscall"
)

/*
#include <sys/mman.h> // for C.MADV_FREE
*/
import "C"

func main() {
    useDontneed := flag.Bool("dontneed", false, "use MADV_DONTNEED instead of MADV_FREE")
    flag.Usage = func() {
        fmt.Fprintf(os.Stderr, "usage: %s [flags] anon-MiB\n", os.Args[0])
        flag.PrintDefaults()
        os.Exit(2)
    }
    flag.Parse()
    if flag.NArg() != 1 {
        flag.Usage()
    }
    anonMB, err := strconv.Atoi(flag.Arg(0))
    if err != nil {
        flag.Usage()
    }

    // anonymous mapping
    m, err := syscall.Mmap(-1, 0, anonMB<<20, syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_PRIVATE|syscall.MAP_ANON)
    if err != nil {
        log.Fatal(err)
    }
    printStats("After anon mmap:", m)

    // page fault by accessing it
    for i := 0; i < len(m); i += pageSize {
        m[i] = 42
    }
    printStats("After anon fault:", m)

    // use different strategy
    if *useDontneed {
        err = syscall.Madvise(m, syscall.MADV_DONTNEED)
        if err != nil {
                log.Fatal(err)
        }
        printStats("After MADV_DONTNEED:", m)
    } else {
        err = syscall.Madvise(m, C.MADV_FREE)
        if err != nil {
                log.Fatal(err)
        }
        printStats("After MADV_FREE:", m)
    }
    runtime.KeepAlive(m)
}

func printStats(ident string, m []byte) {
    fmt.Print(ident, " ", rss(), " MiB RSS\n")
}

Assuming a 10M request, you can see the following result.

1
2
3
4
5
6
7
8
9
$ go run main.go 10
After anon mmap: 2 MiB RSS
After anon fault: 13 MiB RSS
After MADV_FREE: 13 MiB RSS

$ go run main.go -dontneed 10
After anon mmap: 3 MiB RSS
After anon fault: 13 MiB RSS
After MADV_DONTNEED: 3 MiB RSS

The difference is clear: after MADV_FREE ends, RSS is not reduced, while the MADV_DONTNEED policy is returned in full.

PSS/USS vs RSS

So how do we get the PSS/USS values? More detailed memory mapping information is actually further documented in /proc/[pid]/smaps, but it’s a bit tricky to compute because it’s documented by different mmap operations. But this does not prevent us from automating this fetching process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
type mmapStat struct {
    Size           uint64
    RSS            uint64
    PSS            uint64
    PrivateClean   uint64
    PrivateDirty   uint64
    PrivateHugetlb uint64
}

func getMmaps() (*[]mmapStat, error) {
    var ret []mmapStat
    contents, err := ioutil.ReadFile("/proc/self/smaps")
    if err != nil {
        return nil, err
    }
    lines := strings.Split(string(contents), "\n")
    // function of parsing a block
    getBlock := func(block []string) (mmapStat, error) {
        m := mmapStat{}
        for _, line := range block {
            if strings.Contains(line, "VmFlags") ||
                strings.Contains(line, "Name") {
                continue
            }
            field := strings.Split(line, ":")
            if len(field) < 2 {
                continue
            }
            v := strings.Trim(field[1], " kB") // remove last "kB"
            t, err := strconv.ParseUint(v, 10, 64)
            if err != nil {
                return m, err
            }
            switch field[0] {
            case "Size":
                m.Size = t
            case "Rss":
                m.RSS = t
            case "Pss":
                m.PSS = t
            case "Private_Clean":
                m.PrivateClean = t
            case "Private_Dirty":
                m.PrivateDirty = t
            case "Private_Hugetlb":
                m.PrivateHugetlb = t
            }
        }
        return m, nil
    }
    blocks := make([]string, 16)
    for _, line := range lines {
        if strings.HasSuffix(strings.Split(line, " ")[0], ":") == false {
            if len(blocks) > 0 {
                g, err := getBlock(blocks)
                if err != nil {
                    return &ret, err
                }
                ret = append(ret, g)
            }
            blocks = make([]string, 16)
        } else {
            blocks = append(blocks, line)
        }
    }
    return &ret, nil
}

type smapsStat struct {
    VSS uint64 // bytes
    RSS uint64 // bytes
    PSS uint64 // bytes
    USS uint64 // bytes
}

func getSmaps() (*smapsStat, error) {
    mmaps, err := getMmaps()
    if err != nil {
        panic(err)
    }
    smaps := &smapsStat{}
    for _, mmap := range *mmaps {
        smaps.VSS += mmap.Size * 1014
        smaps.RSS += mmap.RSS * 1024
        smaps.PSS += mmap.PSS * 1024
        smaps.USS += mmap.PrivateDirty*1024 + mmap.PrivateClean*1024 + mmap.PrivateHugetlb*1024
    }
    return smaps, nil
}

This can eventually be used as follows.

1
2
3
4
5
6
stat, err := getSmaps()
if err != nil {
    panic(err)
}
fmt.Printf("VSS: %d MiB, RSS: %d MiB, PSS: %d MiB, USS: %d MiB\n",
    stat.VSS/(1<<20), stat.RSS/(1<<20), stat.PSS/(1<<20), stat.USS/(1<<20))

Well, applying it to the previous program, the performance is as follows

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ go run main.go 10 # MADV_FREE
After anon mmap: 2 MiB RSS
After anon fault: 13 MiB RSS
After MADV_FREE: 13 MiB RSS
VSS: 1048 MiB, RSS: 13 MiB, PSS: 12 MiB, USS: 12 MiB

$ go run main.go -dontneed 10
After anon mmap: 2 MiB RSS
After anon fault: 13 MiB RSS
After MADV_DONTNEED: 3 MiB RSS
After anon mmap: 2 MiB RSS
After anon fault: 13 MiB RSS
After MADV_DONTNEED: 3 MiB RSS
VSS: 1049 MiB, RSS: 3 MiB, PSS: 2 MiB, USS: 2 MiB

Yes, there is no difference. Oh then what to monitor? Three means.

  1. GODEBUG=madvdontneed=1, for distributions between 1.12 and 1.16
  2. runtime.ReadMemStats to read the reports periodically. Or use expvar, or the standard pprof hand, except that each is a significant performance penalty for the runtime, since these queries are requires STW.
  3. upgrade to Go 1.16

Of course, there is a fourth way to do this: no monitoring.

If you know Linux system calls well, you might also think of using the mincore system call to check the page out status, which is one way to do it but not for Go, because the user code does not know the address consumed by the process, much less the page. Even if we could, it would be very expensive. Nonetheless, it is possible to check the whole thing, but only if you query the memory that you requested via mmap.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/*
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdint.h>
static int inCore(void *base, uint64_t length, uint64_t pages) {
    int count = 0;
    unsigned char *vec = malloc(pages);
    if (vec == NULL)
        return -1;
    if (mincore(base, length, vec) < 0)
        return -1;
    for (int i = 0; i < pages; i++)
        if (vec[i] != 0)
            count++;
    free(vec);
    return count;
}
*/
import "C"

func inCore(b []byte) int {
    n, err := C.inCore(unsafe.Pointer(&b[0]), C.uint64_t(len(b)), C.uint64_t(len(b)/pageSize))
    if n < 0 {
        log.Fatal(err)
    }
    return int(uintptr(n) * uintptr(pageSize) / (1 << 20)) // MiB
}