Golang Heap Profiling

Studying the performance of a program is important for every engineer, and I can even say this: It is a necessary skill for an engineer.

Golang’s performance analysis tool, pprof, supports the following kinds of analysis.

heap: a sampling of all living objects in current memory (almost a must for languages with GC), which can be used to analyze memory problems.
profile: as opposed to heap, a sampling of the CPU, which can be used to analyze the time-consuming bottleneck of the program.
goroutine: a stack trace of all current goroutines.
allocs: all past memory requests sampled.
threadcreate: system-level thread stack trace.
block: stack trace for blocking on synchronous primitives.
mutex: mutex stack trace of all race relations.
trace: trace of current program execution.

The debugging experience with good tools is also great, and the whole process can be analyzed in a few simple commands. However, due to the limitation of space and the fact that CPU analysis has been mentioned many times before, today we will only talk about how to analyze memory, i.e. Heap.

Heap is generally used for memory leaks, or if you want to optimize the use of memory.

Memory leaks and memory optimization

For memory leaks, these are often difficult to find and analyze because they require monitoring the Go program itself or looking at the OOM logs in Linux’s dmesg to find them.

`1`	`dmesg \| grep oom-killer`

When you find an OOM log, you should consider adding monitoring that you have ignored, because this problem will be reproduced (but often difficult to reproduce on your own machine and in pre-release environments). If you don’t know what the monitoring parameters are, you can look at the monitoring data and simply set a percentage, for example, when your program initializes it takes up 10% of the memory, then once the Go program’s memory usage reaches a certain percentage, such as 50% of the machine’s memory, you will be alerted immediately and you will be able to analyze it.

However, there is no need to go through a lot of trouble because you can add pprof support to your Go program with a few simple lines of code that will not affect the program’s operation and provide web services so that you can analyze it through your browser.

import (
  "net/http"
  _ "net/http/pprof"
)

func main() {
  go func() {
    http.ListenAndServe("localhost:8080", nil)
  }()
}

Then, using the pprof tool provided by go, you can analyze, for example, for memory leaks.

`1`	`go tool pprof http://localhost:8080/debug/pprof/heap`

It will take you to pprof’s REPL, where you can locate the problem with some simple commands. However, for a better analysis experience, there are two things to note.

if your compile parameters are reloaded with -trimpath and -ldflag "-s -w", it is better to remove them, otherwise it will affect your ability to locate the problem.
execute this command on the compiling machine so that it can be analyzed directly down to the level of each line of code.

The next actual example I use belongs to memory usage analysis optimization, and since I haven’t encountered OOM yet, I’ll start with a small example I encountered instead, since the two problems are analyzed in the same way.

How to use pprof

First step, first look at top10.

(pprof) top10
Showing nodes accounting for 3759.91kB, 100% of 3759.91kB total
Showing top 5 nodes out of 24
      flat  flat%   sum%        cum   cum%
 2345.25kB 62.38% 62.38%  2345.25kB 62.38%  io.ReadAll
  902.59kB 24.01% 86.38%   902.59kB 24.01%  compress/flate.NewWriter
         0     0%   100%   902.59kB 24.01%  bufio.(*Writer).Flush
         0     0%   100%   902.59kB 24.01%  compress/gzip.(*Writer).Write
(hereinafter omitted)...

Here’s a hint: flat means that the rightmost call is still free, and cum means the cumulative (cumulative) space requested. top is sorted by flat by default, and you can switch the sorting method with the argument: top10 -cum.

If you don’t see anything unusual here, there’s something else to look at, because the sampling statistics of the Golang heap are divided into four parts.

alloc_objects : requested objects
alloc_space: requested space
inuse_objects: objects in use
inuse_space: space in use (default)

You can switch between them with a command like sample_index=inuse_objects.

In my case, since I am sure here why the first item io.ReadAll is in my program, but the second item compress/flate.NewWriter makes me think there is an exception, but I don’t know where it is called. So, after identifying the exception item, the second step can be to further confirm the call chain by tree.

(pprof) tree 10 compress
Active filters:
   focus=compress
Showing nodes accounting for 2354.01kB, 29.36% of 8018.09kB total
Showing top 10 nodes out of 11
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                         2354.01kB   100% |   compress/gzip.(*Writer).Write
 1805.17kB 22.51% 22.51%  2354.01kB 29.36%                | compress/flate.NewWriter
                                          548.84kB 23.32% |   compress/flate.(*compressor).init
----------------------------------------------------------+-------------
                                          548.84kB   100% |   compress/flate.(*compressor).init (inline)
  548.84kB  6.85% 29.36%   548.84kB  6.85%                | compress/flate.(*compressor).initDeflate
----------------------------------------------------------+-------------
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.MetricFamilyToText.func1
         0     0% 29.36%  2354.01kB 29.36%                | bufio.(*Writer).Flush
                                         2354.01kB   100% |   compress/gzip.(*Writer).Write
----------------------------------------------------------+-------------
                                          548.84kB   100% |   compress/flate.NewWriter
         0     0% 29.36%   548.84kB  6.85%                | compress/flate.(*compressor).init
                                          548.84kB   100% |   compress/flate.(*compressor).initDeflate (inline)
----------------------------------------------------------+-------------
                                         2354.01kB   100% |   bufio.(*Writer).Flush
         0     0% 29.36%  2354.01kB 29.36%                | compress/gzip.(*Writer).Write
                                         2354.01kB   100% |   compress/flate.NewWriter
----------------------------------------------------------+-------------
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.NewEncoder.func7
         0     0% 29.36%  2354.01kB 29.36%                | github.com/prometheus/common/expfmt.MetricFamilyToText
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.MetricFamilyToText.func1
----------------------------------------------------------+-------------
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.MetricFamilyToText
         0     0% 29.36%  2354.01kB 29.36%                | github.com/prometheus/common/expfmt.MetricFamilyToText.func1
                                         2354.01kB   100% |   bufio.(*Writer).Flush
----------------------------------------------------------+-------------
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.encoderCloser.Encode
         0     0% 29.36%  2354.01kB 29.36%                | github.com/prometheus/common/expfmt.NewEncoder.func7
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.MetricFamilyToText
----------------------------------------------------------+-------------
                                         2354.01kB   100% |   xizhibei-app/controllers/internal_rpc.(*SystemCtrl).GetMetrics
         0     0% 29.36%  2354.01kB 29.36%                | github.com/prometheus/common/expfmt.encoderCloser.Encode
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.NewEncoder.func7
----------------------------------------------------------+-------------
         0     0% 29.36%  2354.01kB 29.36%                | xizhibei-app/controllers/internal_rpc.(*SystemCtrl).GetMetrics
                                         2354.01kB   100% |   github.com/prometheus/common/expfmt.encoderCloser.Encode
----------------------------------------------------------+-------------

Now, we can basically confirm that there was a small problem in my implementation of GetMetrics when handling the serialization compression of the prometheus client (but not to the point of a memory leak). Alternatively, here you can add a Step 3: use the list plus keyword command to see the analysis down to the level of each line of code.

After locating the problem, it is the last step to solve it, and my solution is to use sync.Pool. Previously, I was using gzip.NewWriter directly to compress the metric text each time it was taken out of prometheus, but this caused gzip to duplicate memory requests and initialization multiple times, so I modified the code to use sync.Pool instead.

The original code.

1
2

buf := new(bytes.Buffer)
gzipWritter := gzip.NewWriter(buf)

The Modified code.

var (
	gzipPool = sync.Pool{
		New: func() interface{} {
			return gzip.NewWriter(nil)
		},
	}
	bufferPool = sync.Pool{
		New: func() interface{} {
			return new(bytes.Buffer)
		},
	}
)

...

gzipWritter := gzipPool.Get().(*gzip.Writer)
defer gzipPool.Put(gzipWritter)

buf := bufferPool.Get().(*bytes.Buffer)
defer bufferPool.Put(buf)

buf.Reset()
gzipWritter.Reset(buf)

We can write a benchmark to test.

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz
BenchmarkEncode-20                          2422            504022 ns/op          851822 B/op        129 allocs/op
BenchmarkEncodeWithSyncPool-20              7654            150188 ns/op           48799 B/op        108 allocs/op

You can see that the allocs of memory has dropped from 129 to 108.

Well, that’s it for now.

For most people, it is simpler to analyze the problem with a mouse click on the web page, as the tool Go pprof currently does a one-stop shop where you can see the call chart as well as the flame chart directly on the web page. For more details, see the official documentation.

Table of Contents

Memory leaks and memory optimization

How to use pprof