Robust, readable and efficient code is a common goal for all of us developers. In this article, we will combine the features of Go language to give advice on common data structures, memory management and concurrency for writing more efficient code. Without further ado, let’s learn the techniques of Go high-performance programming together.

1. Common Data Structures

1.1 Don’t abuse reflection

The standard library reflect provides the Go language with the ability to dynamically obtain the types and values of objects and dynamically create objects at runtime. Reflection can help abstract and simplify code, making development more efficient.

The Go language’s reflection capabilities are used in the Go standard library and in many open source software, such as json for serialization and deserialization, the ORM framework gorm, xorm, and so on.

1.1.1 Prefer strconv to fmt

For conversions between basic data types and strings, strconv is preferred over fmt because the former has better performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Bad
for i := 0; i < b.N; i++ {
 s := fmt.Sprint(rand.Int())
}

BenchmarkFmtSprint-4    143 ns/op    2 allocs/op

// Good
for i := 0; i < b.N; i++ {
 s := strconv.Itoa(rand.Int())
}

BenchmarkStrconv-4    64.2 ns/op    1 allocs/op

The reason why there is more than twice the performance difference is that the fmt implementation uses reflection to achieve the effect of paradigm, and makes dynamic type determination at runtime, so it brings a certain performance loss.

1.1.2 A little repetition is not worse than reflection

Sometimes, we need some tool functions. For example, filtering out specified elements from uint64 slices.

Using reflection, we can implement a slicing filter function with type generalization support extension.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// DeleteSliceElms Filters the specified element from the slice. Note: The original slice is not modified.
func DeleteSliceElms(i interface{}, elms ...interface{}) interface{} {
 // make map set。
 m := make(map[interface{}]struct{}, len(elms))
 for _, v := range elms {
  m[v] = struct{}{}
 }
 // Creates a new slice, filtering out the specified elements.
 v := reflect.ValueOf(i)
 t := reflect.MakeSlice(reflect.TypeOf(i), 0, v.Len())
 for i := 0; i < v.Len(); i++ {
  if _, ok := m[v.Index(i).Interface()]; !ok {
   t = reflect.Append(t, v.Index(i))
  }
 }
 return t.Interface()
}

In many cases, we may only need to manipulate a single type of slice, and the ability to extend type generalization using reflection is simply not useful. If we really need to filter on slices of types other than uint64, what’s the harm in copying the code once? To be sure, in most scenarios, there will be no filtering of all types of slices, so the benefits of reflection are not fully enjoyed, but the performance costs are paid for.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// DeleteU64liceElms Filter the specified element from []uint64. Note: Do not modify the original slice.
func DeleteU64liceElms(i []uint64, elms ...uint64) []uint64 {
 // create map set。
 m := make(map[uint64]struct{}, len(elms))
 for _, v := range elms {
  m[v] = struct{}{}
 }
 // Creates a new slice, filtering out the specified elements.
 t := make([]uint64, 0, len(i))
 for _, v := range i {
  if _, ok := m[v]; !ok {
   t = append(t, v)
  }
 }
 return t
}

Here’s a look at the performance comparison between the two.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func BenchmarkDeleteSliceElms(b *testing.B) {
 slice := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9}
 elms := []interface{}{uint64(1), uint64(3), uint64(5), uint64(7), uint64(9)}
 for i := 0; i < b.N; i++ {
  _ = DeleteSliceElms(slice, elms...)
 }
}

func BenchmarkDeleteU64liceElms(b *testing.B) {
 slice := []uint64{1, 2, 3, 4, 5, 6, 7, 8, 9}
 elms := []uint64{1, 3, 5, 7, 9}
 for i := 0; i < b.N; i++ {
  _ = DeleteU64liceElms(slice, elms...)
 }
}

The benchmark test results are shown below.

1
2
3
4
5
6
7
8
9
go test -bench=. -benchmem main/reflect 
goos: darwin
goarch: amd64
pkg: main/reflect
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkDeleteSliceElms-12              1226868               978.2 ns/op           296 B/op         16 allocs/op
BenchmarkDeleteU64liceElms-12            8249469               145.3 ns/op            80 B/op          1 allocs/op
PASS
ok      main/reflect    3.809s

As you can see, reflection involves additional type judgments and a lot of memory allocation, resulting in a very significant performance impact. As the number of sliced elements increments, each time a judgment is made about whether an element is in a map, because the key of the map is of an indeterminate type, variable escapes occur, triggering the allocation of heap memory. So, predictably the performance difference gets bigger as the number of elements increases.

When using reflection, ask yourself, do I really need it?

1.1.3 Using binary.Read and binary.Write with caution

binary.Read and binary.Write use reflection and are very slow. If there is a need for these functions, we should implement them manually instead of using them directly.

The encoding/binary package implements a simple conversion between numbers and byte sequences as well as encoding and decoding of varints.

varints is a method of representing integers using variable bytes, where the smaller the value itself, the fewer bytes it takes. This is how Protocol Buffers encodes integers.

Where the conversion of numbers to byte sequences can be done with the following three functions.

1
2
3
4
5
6
// Read 从结构化二进制数据 r 读取到 data。data 必须是指向固定大小值的指针或固定大小值的切片。
func Read(r io.Reader, order ByteOrder, data interface{}) error
// Write 将 data 的二进制表示形式写入 w。data 必须是固定大小的值或固定大小值的切片,或指向此类数据的指针。
func Write(w io.Writer, order ByteOrder, data interface{}) error
// Size 返回 Wirte 函数将 v 写入到 w 中的字节数。
func Size(v interface{}) int

Here is an example of the familiar C standard library function ntohl() to see how Go implements it using the binary package.

1
2
3
4
5
6
7
8
// Ntohl 将网络字节序的 uint32 转为主机字节序。
func Ntohl(bys []byte) uint32 {
 r := bytes.NewReader(bys)
 err = binary.Read(buf, binary.BigEndian, &num)
}

// 如将 IP 127.0.0.1 网络字节序解析到 uint32
fmt.Println(Ntohl([]byte{0x7f, 0, 0, 0x1})) // 2130706433 <nil>

What if we manually implement an ntohl() for the uint32 type?

1
2
3
4
5
6
func NtohlNotUseBinary(bys []byte) uint32 {
 return uint32(bys[3]) | uint32(bys[2])<<8 | uint32(bys[1])<<16 | uint32(bys[0])<<24
}

// 如将 IP 127.0.0.1 网络字节序解析到 uint32
fmt.Println(NtohlNotUseBinary([]byte{0x7f, 0, 0, 0x1})) // 2130706433

This function is also a reference to the encoding/binary package implementation when converting byte sequences to uint32 types for large-ended byte sequences.

Here is a look at the performance difference between the two before and after stripping the reflection.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func BenchmarkNtohl(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _, _ = Ntohl([]byte{0x7f, 0, 0, 0x1})
 }
}

func BenchmarkNtohlNotUseBinary(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _ = NtohlNotUseBinary([]byte{0x7f, 0, 0, 0x1})
 }
}

Run the benchmark test above with the following results.

1
2
3
4
5
6
7
8
9
go test -bench=BenchmarkNtohl.* -benchmem main/reflect
goos: darwin
goarch: amd64
pkg: main/reflect
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkNtohl-12                       13026195                81.96 ns/op           60 B/op          4 allocs/op
BenchmarkNtohlNotUseBinary-12           1000000000               0.2511 ns/op          0 B/op          0 allocs/op
PASS
ok      main/reflect    1.841s

It can be seen that the performance of encoding/binary packages implemented using reflection is very different compared to the type-specific versions.

1.2 Avoid repetitive string to byte slice conversions

Do not repeatedly create a byte slice from a fixed string, as repeated slice initialization introduces performance loss. Instead, perform the conversion once and capture the result.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Bad
for i := 0; i < b.N; i++ {
 w.Write([]byte("Hello world"))
}

BenchmarkBad-4   50000000   22.2 ns/op

// Good
data := []byte("Hello world")
for i := 0; i < b.N; i++ {
 w.Write(data)
}

BenchmarkGood-4  500000000   3.25 ns/op

1.3 Specifying the container capacity

Specify the container capacity whenever possible in order to pre-allocate memory for the container. This will reduce the need to resize the container by copying when elements are subsequently added.

1.3.1 Specifying map capacity hints

Whenever possible, provide capacity information when initializing with make().

1
make(map[T1]T2, hint)

Providing a capacity hint to make() will attempt to resize the map at initialization time, which will reduce the need to reallocate memory for the map when elements are added to the map.

Note that unlike slice, the map capacity hint does not guarantee full preemptive allocation, but is used to estimate the number of hashmap buckets needed. Therefore, allocations may still occur when adding elements to a map, even when specifying map capacity.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// Bad
m := make(map[string]os.FileInfo)

files, _ := ioutil.ReadDir("./files")
for _, f := range files {
    m[f.Name()] = f
}
// m is created without size hints; more allocations may be available at runtime.

// Good
files, _ := ioutil.ReadDir("./files")

m := make(map[string]os.FileInfo, len(files))
for _, f := range files {
    m[f.Name()] = f
}
// m is created with size hints; fewer allocations may be available at runtime.

1.3.2 Specifying slicing capacity

Whenever possible, provide capacity information when initializing slices with make(), especially when appending slices.

1
make([]T, length, capacity)

Unlike map, slice capacity is not a hint: the compiler will allocate enough memory for the capacity of the slice provided to make(), which means that subsequent append() operations will result in zero allocation (until the length of the slice matches the capacity, after which any append may be resized to hold additional elements).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
const size = 1000000

// Bad
for n := 0; n < b.N; n++ {
 data := make([]int, 0)
   for k := 0; k < size; k++ {
     data = append(data, k)
  }
}

BenchmarkBad-4    219    5202179 ns/op

// Good
for n := 0; n < b.N; n++ {
 data := make([]int, 0, size)
   for k := 0; k < size; k++ {
     data = append(data, k)
  }
}

BenchmarkGood-4   706    1528934 ns/op

The benchmark test results are shown below.

1
2
3
go test -bench=^BenchmarkJoinStr -benchmem 
BenchmarkJoinStrWithOperator-8    66930670    17.81 ns/op    0 B/op    0 allocs/op
BenchmarkJoinStrWithSprintf-8      7032921    166.0 ns/op    64 B/op   4 allocs/op

1.4 Choice of string splicing methods

The two most commonly used methods for splicing strings within lines are as follows for ease and speed of writing.

  • operator +
  • fmt.Sprintf()

The main goal of in-line string splicing is to make the code simple and readable. fmt.Sprintf() is very easy to use as it can take different types of input parameters and format the output to complete the string splicing. However, the underlying implementation uses reflection, so there is some performance loss.

The operator + simply splices strings together, and variables of non-string types require separate type conversions. In-line splicing of strings does not result in memory allocation and does not involve dynamic type conversion, so it outperforms fmt.Sprintf().

For performance reasons and ease of readability, if the variables to be spliced do not involve type conversion and the number is small (<=5), the operator + is recommended for in-line splicing of strings, and fmt.Sprintf() is recommended for the opposite.

Here’s a performance comparison between the two.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Good
func BenchmarkJoinStrWithOperator(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  _ = s1 + s2 + s3
 }
}

// Bad
func BenchmarkJoinStrWithSprintf(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  _ = fmt.Sprintf("%s%s%s", s1, s2, s3)
 }
}

The results of performing benchmark tests are shown below.

1
2
3
go test -bench=^BenchmarkJoinStr -benchmem .
BenchmarkJoinStrWithOperator-8    70638928    17.53 ns/op     0 B/op    0 allocs/op
BenchmarkJoinStrWithSprintf-8      7520017    157.2 ns/op    64 B/op    4 allocs/op

There are other ways of string splicing, such as stringsJ.oin(), strings.Builder, bytes.Buffer and []byte, which are not suitable for in-line use. Consider using them when the number of strings to be spliced is large.

Let’s first look at the comparison of their performance tests.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
func BenchmarkJoinStrWithStringsJoin(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  _ = strings.Join([]string{s1, s2, s3}, "")
 }
}

func BenchmarkJoinStrWithStringsBuilder(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  var builder strings.Builder
  _, _ = builder.WriteString(s1)
  _, _ = builder.WriteString(s2)
  _, _ = builder.WriteString(s3)
 }
}

func BenchmarkJoinStrWithBytesBuffer(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  var buffer bytes.Buffer
  _, _ = buffer.WriteString(s1)
  _, _ = buffer.WriteString(s2)
  _, _ = buffer.WriteString(s3)
 }
}

func BenchmarkJoinStrWithByteSlice(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  var bys []byte
  bys= append(bys, s1...)
  bys= append(bys, s2...)
  _ = append(bys, s3...)
 }
}

func BenchmarkJoinStrWithByteSlicePreAlloc(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  bys:= make([]byte, 0, 9)
  bys= append(bys, s1...)
  bys= append(bys, s2...)
  _ = append(bys, s3...)
 }
}

The benchmark test results are shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
go test -bench=^BenchmarkJoinStr .
goos: windows
goarch: amd64
pkg: main/perf
cpu: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
BenchmarkJoinStrWithStringsJoin-8               31543916                36.39 ns/op
BenchmarkJoinStrWithStringsBuilder-8            30079785                40.60 ns/op
BenchmarkJoinStrWithBytesBuffer-8               31663521                39.58 ns/op
BenchmarkJoinStrWithByteSlice-8                 30748495                37.34 ns/op
BenchmarkJoinStrWithByteSlicePreAlloc-8         665341896               1.813 ns/op

From the results, we can see that strings.Join(), strings.Builder, bytes.Buffer and []byte have similar performance. If the length of the result string is predictable, the best performance is achieved by using []byte with pre-allocated capacity for splicing.

Therefore, if performance requirements are very strict, or if the number of strings to be stitched is large enough, []byte with pre-allocated capacity is recommended.

For ease of use and performance, strings.Builder is generally recommended for string splicing.

The string.Builder also provides a way to pre-allocate memory Grow.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func BenchmarkJoinStrWithStringsBuilderPreAlloc(b *testing.B) {
 s1, s2, s3 := "foo", "bar", "baz"
 for i := 0; i < b.N; i++ {
  var builder strings.Builder
  builder.Grow(9)
  _, _ = builder.WriteString(s1)
  _, _ = builder.WriteString(s2)
  _, _ = builder.WriteString(s3)
 }
}

The performance test results of the version optimized with Grow are as follows. You can see that the performance is much improved compared to the way of not pre-allocating space.

1
BenchmarkJoinStrWithStringsBuilderPreAlloc-8    60079003                20.95 ns/op

1.5 Iterate through []struct{} using indexes instead of range

There are two ways to iterate over slices or arrays in Go, one by subscript and one by range. there is no functional difference between the two, but is there a performance difference?

1.5.1 []int

Let’s first look at the performance difference when traversing slices of basic types, using []int as an example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// genRandomIntSlice 生成指定长度的随机 []int 切片
func genRandomIntSlice(n int) []int {
 rand.Seed(time.Now().UnixNano())
 nums := make([]int, 0, n)
 for i := 0; i < n; i++ {
  nums = append(nums, rand.Int())
 }
 return nums
}

func BenchmarkIndexIntSlice(b *testing.B) {
 nums := genRandomIntSlice(1024)
 for i := 0; i < b.N; i++ {
  var tmp int
  for k := 0; k < len(nums); k++ {
   tmp = nums[k]
  }
  _ = tmp
 }
}

func BenchmarkRangeIntSlice(b *testing.B) {
 nums := genRandomIntSlice(1024)
 for i := 0; i < b.N; i++ {
  var tmp int
  for _, num := range nums {
   tmp = num
  }
  _ = tmp
 }
}

The results of the running tests are as follows.

1
2
3
4
5
6
7
go test -bench=IntSlice$ .
goos: windows
goarch: amd64
pkg: main/perf
cpu: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
BenchmarkIndexIntSlice-8         5043324               236.2 ns/op
BenchmarkRangeIntSlice-8         5076255               239.1 ns/op

The genRandomIntSlice() function is used to generate slices of the specified length with elements of type int. As you can see from the final result, there is almost no difference between the subscript and range traversal performance when traversing slices of type []int.

1.5.2 []struct

What about for the slightly more complex []struct type?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
type Item struct {
 id  int
 val [1024]byte
}

func BenchmarkIndexStructSlice(b *testing.B) {
 var items [1024]Item
 for i := 0; i < b.N; i++ {
  var tmp int
  for j := 0; j < len(items); j++ {
   tmp = items[j].id
  }
  _ = tmp
 }
}

func BenchmarkRangeIndexStructSlice(b *testing.B) {
 var items [1024]Item
 for i := 0; i < b.N; i++ {
  var tmp int
  for k := range items {
   tmp = items[k].id
  }
  _ = tmp
 }
}

func BenchmarkRangeStructSlice(b *testing.B) {
 var items [1024]Item
 for i := 0; i < b.N; i++ {
  var tmp int
  for _, item := range items {
   tmp = item.id
  }
  _ = tmp
 }
}

The results of the running tests are as follows.

1
2
3
4
5
6
7
8
go test -bench=StructSlice$ .
goos: windows
goarch: amd64
pkg: main/perf
cpu: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
BenchmarkIndexStructSlice-8              5079468               234.9 ns/op
BenchmarkRangeIndexStructSlice-8         5087448               236.2 ns/op
BenchmarkRangeStructSlice-8                38716               32265 ns/op

As you can see, there is no difference in performance between the two types of []struct traversal by index, but range has very poor performance when traversing the elements in []struct.

When range only traverses []struct indexes, the performance is much better than when range traverses []struct values. From this, we should be able to see the reason for the big performance difference between the two.

Item is a structure type , Item consists of two fields, one of type int and one of type [1024]byte, if each iteration of []Item, a copy of the value will be made, so it brings performance loss.

In addition, since the range gets a copy of the value, modifications to the copy will not affect the original slice.

1.5.3 []*struct

What about if the slice contains a pointer to a structure instead of a structure?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// genItems 生成指定长度 []*Item 切片
func genItems(n int) []*Item {
 items := make([]*Item, 0, n)
 for i := 0; i < n; i++ {
  items = append(items, &Item{id: i})
 }
 return items
}

func BenchmarkIndexPointer(b *testing.B) {
 items := genItems(1024)
 for i := 0; i < b.N; i++ {
  var tmp int
  for k := 0; k < len(items); k++ {
   tmp = items[k].id
  }
  _ = tmp
 }
}

func BenchmarkRangePointer(b *testing.B) {
 items := genItems(1024)
 for i := 0; i < b.N; i++ {
  var tmp int
  for _, item := range items {
   tmp = item.id
  }
  _ = tmp
 }
}

The results of the execution performance tests are as follows.

1
2
3
4
5
6
7
go test -bench=Pointer$ main/perf
goos: windows
goarch: amd64
pkg: main/perf
cpu: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
BenchmarkIndexPointer-8           773634              1521 ns/op
BenchmarkRangePointer-8           752077              1514 ns/op

The performance of for and range is almost the same after replacing the slice element from the structure Item with a pointer *Item. And using a pointer has the added benefit of directly modifying the value of the pointer’s corresponding structure.

1.5.4 Summary

range returns a copy of the element during iteration, while index does not have a copy.

If range iterates over a small element, index and range perform almost identically, as in the case of a sliced []int of a basic type, but if it iterates over a large element, such as a struct containing many attributes, index will perform significantly better than range, sometimes by a factor of a thousand.

For this scenario, it is recommended to use index, but if you use range, it is recommended to iterate over only the index and access the elements through the index, which is no different from index. If you want to use range to iterate over both index and value, you need to change the elements of the slice/array to pointers in order not to affect the performance.

2. RAM Management

2.1 Saving memory with empty structs

2.1.1 Not taking up memory space

In Go, we can use unsafe.Sizeof to calculate the number of bytes an instance of a data type needs to occupy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
package main

import (
 "fmt"
 "unsafe"
)

func main() {
 fmt.Println(unsafe.Sizeof(struct{}{}))
}

The results of the run are as follows.

1
2
go run main.go
0

As you can see, the empty struct struct{} in Go does not occupy memory space, unlike the empty struct in C/C++ which still occupies 1 byte.

2.1.2 Usage of Empty Structs

Because empty structs do not occupy memory space, they are widely used as placeholders in various scenarios. One is to save resources, and the other is that the empty structure itself has strong semantics, i.e., no value is needed here, and it is only used as a placeholder to achieve the effect of the code that is annotated.

Sets

The Go language standard library does not provide an implementation of Set, and maps are usually used instead. In fact, for sets, only the keys of the map are needed, not the values. Even if the value is set to bool, it will take up an extra 1 byte, so assuming there are a million items in the map, that’s 1MB of wasted space.

Therefore, when using map as a set, you can define the value type as an empty structure and use it only as a placeholder.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
type Set map[string]struct{}

func (s Set) Has(key string) bool {
 _, ok := s[key]
 return ok
}

func (s Set) Add(key string) {
 s[key] = struct{}{}
}

func (s Set) Delete(key string) {
 delete(s, key)
}

func main() {
 s := make(Set)
 s.Add("foo")
 s.Add("bar")
 fmt.Println(s.Has("foo"))
 fmt.Println(s.Has("bar"))
}

If you want to use the full functionality of Set, such as initialization (building a Set by slicing it), Add, Del, Clear, Contains and other operations, you can use the open source library golang-set.

Channels that do not send data
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func worker(ch chan struct{}) {
 <-ch
 fmt.Println("do something")
}

func main() {
 ch := make(chan struct{})
 go worker(ch)
 ch <- struct{}{}
 close(ch)
}

Sometimes a channel is used without sending any data, only to notify a sub-goroutine of a task, or only to control the concurrency of a goroutine. In this case, it is very appropriate to use an empty structure as a placeholder.

Structs that contain only methods
1
2
3
4
5
6
7
8
9
type Door struct{}

func (d Door) Open() {
 fmt.Println("Open the door")
}

func (d Door) Close() {
 fmt.Println("Close the door")
}

In some scenarios, the structure contains only methods and not any fields. For example the Door in the above example, in this case the Door can be replaced by virtually any data structure.

1
2
type Door int
type Door bool

Either int or bool will waste extra memory, so in this case, it is best to declare it as an empty structure.

2.2 struct layout with memory alignment in mind

2.2.1 Why memory alignment is needed

The CPU does not access memory byte by byte, but by word size. For example, if a 32-bit CPU has a word size of 4 bytes, then the CPU accesses memory in units of 4 bytes.

The purpose of this design is to reduce the number of CPU accesses to memory and increase the throughput of CPU accesses to memory. For example, if the same 8 bytes of data is read, 4 bytes are read at a time, then only 2 times are needed.

The CPU always accesses memory at word length and without memory alignment, it is likely to increase the number of CPU accesses to memory, for example.

memory alignment

Variables a and b each occupy 3 bytes of space. After memory alignment, a and b occupy 4 bytes of space, and the CPU only needs to make one memory access to read the value of variable b. If memory alignment is not performed, the CPU needs to perform 2 memory accesses to read the value of the b variable. The first access gives the first byte of the b variable and the second access gives the last two bytes of the b variable.

As you can also see from this example, memory alignment is also beneficial for achieving atomic operation of variables. Each memory access is atomic, and if the size of the variable does not exceed the word length, then the access to the variable is atomic after memory alignment, a feature that is crucial in concurrency scenarios.

In short: reasonable memory alignment improves the performance of memory reads and writes, and facilitates the atomicity of variable operations.

2.2.2 Go Memory Alignment Rules

Compilers generally align variables in memory in order to reduce CPU access instruction cycles and improve memory access efficiency. Go is a high-performance backend programming language, so it is no exception.

The size and alignment guarantees in the Go Language Specification describe the memory alignment rules.

  1. For a variable x of any type: unsafe.Alignof(x) is at least 1.
  2. For a variable x of struct type: unsafe.Alignof(x) is the largest of all the values unsafe.Alignof(x.f) for each field f of x, but at least 1.
  3. For a variable x of array type: unsafe.Alignof(x) is the same as the alignment of a variable of the array’s element type.

The function unsafe.Alignof is used to get the alignment factor of a variable. The alignment factor determines the offset of the field and the size of the variable, both of which must be integer multiples of the alignment factor.

2.2.3 Reasonable struct layout

Because of memory alignment, a reasonable struct layout can reduce memory usage and improve program performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
type demo1 struct {
 a int8
 b int16
 c int32
}

type demo2 struct {
 a int8
 c int32
 b int16
}

func main() {
 fmt.Println(unsafe.Sizeof(demo1{})) // 8
 fmt.Println(unsafe.Sizeof(demo2{})) // 12
}

As you can see, the same fields end up with different structure sizes depending on the order in which the fields are aligned.

Each field determines its offset in memory according to its own alignment factor, and the size of a field wasted due to the offset varies.

This is analyzed next, one by one. First is demo1: a is the first field, which is aligned by default, and occupies 1 byte from position 0. b is the second field, with an alignment factor of 2, so it must be left empty by 1 byte for the offset to be a multiple of 2, and occupies 2 bytes from position 2. c is the third field, with an alignment factor of 4, at which point the memory is already aligned, and occupies 4 bytes from position 4. This is the third field with an alignment multiple of 4.

So demo1’s memory footprint is 8 bytes.

For demo2: a is the first field, which is aligned by default and occupies 1 byte starting at position 0. c is the second field, which is aligned by a factor of 4. Therefore, 3 bytes must be left blank for the offset to be a multiple of 4 and occupy 4 bytes starting at position 4. b is the third field, which is aligned by a factor of 2 and occupies 2 bytes starting at position 8.

The alignment factor of demo2 is determined by the alignment factor of c, which is also 4. Therefore, demo2 occupies 12 bytes of memory.

struct layout

Therefore, in the design of structures that are particularly memory-sensitive, we can reduce the memory footprint by adjusting the order of the fields in the structure, arranging the field widths from smallest to largest from top to bottom.

2.2.4 Effects of Empty Structs and Arrays on Memory Alignment

Empty structures and arrays are special in Go. An empty struct{} without any fields and an array without any elements occupies a memory space of size 0.

Because of this, memory alignment is generally not required when the empty struct{} or empty array is used as a field in another struct. There is one exception: when struct{} or an empty array is the last field of a struct, memory alignment is required. This is because if there is a pointer to this field, the returned address will be outside the struct, and if this pointer stays alive without freeing the corresponding memory, there will be a memory leak (the memory is not freed by the structure release).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
type demo3 struct {
 a struct{}
 b int32
}
type demo4 struct {
 b int32
 a struct{}
}

func main() {
 fmt.Println(unsafe.Sizeof(demo3{})) // 4
 fmt.Println(unsafe.Sizeof(demo4{})) // 8
}

As you can see, demo3{} is 4 bytes in size, which is the same as the space occupied by field b, while demo4{} is 8 bytes in size, which means an additional 4 bytes of space is filled.

2.3 Reducing escapes and limiting variables to the stack

Variable escapes generally occur in the following cases.

  • Variables are large
  • Variables of uncertain size
  • Uncertain variable type
  • Returning a pointer
  • Returning a reference
  • Closures

After knowing the causes of variable escape, we can consciously control variables from escaping by keeping them on the stack, reducing the allocation of heap variables, reducing GC costs, and improving program performance.

2.3.1 Smaller copies are better than references

A small copy is better than a reference, and what that means is to try to use stack variables instead of heap variables. Here’s a counterintuitive example to demonstrate that a small copy is better than creating a reference variable on the heap.

We all know that Array is passed as pass-by-value in Go, and with its unscalable length, we generally use it sparingly for performance reasons. In reality, there are no absolutes. Sometimes it’s better to use an array for copy-passing than to use slices.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// copy/copy.go

const capacity = 1024

func arrayFibonacci() [capacity]int {
 var d [capacity]int
 for i := 0; i < len(d); i++ {
  if i <= 1 {
   d[i] = 1
   continue
  }
  d[i] = d[i-1] + d[i-2]
 }
 return d
}

func sliceFibonacci() []int {
 d := make([]int, capacity)
 for i := 0; i < len(d); i++ {
  if i <= 1 {
   d[i] = 1
   continue
  }
  d[i] = d[i-1] + d[i-2]
 }
 return d
}

See below for a performance comparison.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func BenchmarkArray(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _ = arrayFibonacci()
 }
}

func BenchmarkSlice(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _ = sliceFibonacci()
 }
}

Running the benchmark test above will give you the following results.

1
2
3
4
5
6
7
8
9
go test -bench=. -benchmem -gcflags="-l" main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkArray-12         692400              1708 ns/op               0 B/op          0 allocs/op
BenchmarkSlice-12         464974              2242 ns/op            8192 B/op          1 allocs/op
PASS
ok      main/copy       3.908s

As you can see from the test results, the performance of copying to arrays is nevertheless better than using slices. Why is this so?

The local variable slice allocated in the sliceFibonacci() function escapes and needs to request memory space on the heap because it has to be returned outside the function. It is also clear from the test that the arrayFibonacci() function has no memory allocation and completes the creation of the array entirely on the stack. Here it shows that for some short objects, the cost of copying on the stack is much less than allocating and recycling operations on the heap.

Note that when running the above benchmark, the compile option “-l” which disables inlining is passed, if inlining occurs, then there will be no variable escape, there will be no memory allocation and recycling operations on the heap, and no performance difference will be visible between the two.

The compiler’s optimization decisions for the above two functions can be viewed at compile time with the help of the option -gcflags=-m.

1
2
3
4
5
go build  -gcflags=-m copy/copy.go
# command-line-arguments
copy/copy.go:5:6: can inline arrayFibonacci
copy/copy.go:17:6: can inline sliceFibonacci
copy/copy.go:18:11: make([]int, capacity) escapes to heap

You can see that both arrayFibonacci() and sliceFibonacci() functions can be inlined. sliceFibonacci() function defines a local variable slice that escapes to the heap.

So how big is a variable considered small? For the Go compiler, local variables over a certain size will escape to the heap, and the size limit may be different for different Go versions. The general limit is <64KB, and local variables will not escape to the heap.

2.3.2 Returning a value vs. returning a pointer

Value passing copies the entire object, while pointer passing only copies the address, pointing to the same object. Return pointers reduce value copying, but can cause memory allocations to escape to the heap, increasing the burden on garbage collection (GC). In scenarios where objects are frequently created and deleted, the GC overhead caused by passing pointers can have a serious performance impact.

In general, for structures that need to modify the original object value, or have a relatively large memory footprint, choose to return a pointer. For read-only structures with a smaller memory footprint, returning the value directly can yield better performance.

2.3.3 Return value using a deterministic type

If the variable type is not determined, then it will escape to the heap. Therefore, if the return value of a function can be of a definite type, do not use interface{}.

Let’s use the Fibonacci series function above as an example to see the performance difference between a deterministic return value and interface{}.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
const capacity = 1024

func arrayFibonacci() [capacity]int {
 var d [capacity]int
 for i := 0; i < len(d); i++ {
  if i <= 1 {
   d[i] = 1
   continue
  }
  d[i] = d[i-1] + d[i-2]
 }
 return d
}

func arrayFibonacciIfc() interface{} {
 var d [capacity]int
 for i := 0; i < len(d); i++ {
  if i <= 1 {
   d[i] = 1
   continue
  }
  d[i] = d[i-1] + d[i-2]
 }
 return d
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func BenchmarkArray(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _ = arrayFibonacci()
 }
}

func BenchmarkIfc(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _ = arrayFibonacciIfc()
 }
}

The results of running the benchmark test above are as follows.

1
2
3
4
5
6
7
8
9
go test -bench=. -benchmem main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkArray-12         832418              1427 ns/op               0 B/op          0 allocs/op
BenchmarkIfc-12           380626              2861 ns/op            8192 B/op          1 allocs/op
PASS
ok      main/copy       3.742s

As can be seen, when a function return value is returned using interface{}, the compiler cannot determine the specific type of the return value, resulting in the return value escaping to the heap. The performance is a little worse when the request and recovery of memory on the heap occurs.

2.4 sync.Pool reuse object

2.4.1 Introduction

sync.Pool is a component of the sync package that acts as a “pool” to hold temporarily retrieved objects. Personally, I think the name is misleading, as the objects contained in the pool can be reclaimed without notice, and syncCache is probably a more appropriate name.

The sync.Pool is scalable and concurrency-safe, and its capacity is limited only by the size of its memory. Objects stored in the pool are automatically cleaned up if they become inactive.

2.4.2 Role

For many places where memory needs to be repeatedly allocated and reclaimed, sync.Pool is a good choice. Frequent memory allocation and recovery will put a certain burden on the GC, and in severe cases will cause CPU burrs. sync.Pool can cache temporarily unused objects and use them directly when they are needed next time, without having to go through memory allocation again, reusing the memory of the objects, reducing the pressure on the GC and improving the system performance.

In a nutshell: it is used to save and reuse temporary objects, reduce memory allocation, and reduce GC pressure.

2.4.3 Usage

The sync.Pool is very simple to use, it just needs to implement the New function. When there is no object in the pool, the New function will be called to create it.

Suppose we have a “student” structure and reuse the structure object.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
type Student struct {
 Name   string
 Age    int32
 Remark [1024]byte
}

var studentPool = sync.Pool{
    New: func() interface{} { 
        return new(Student) 
    },
}

Then call the Pool’s Get() and Put() methods to get and put back into the pool.

1
2
3
stu := studentPool.Get().(*Student)
json.Unmarshal(buf, stu)
studentPool.Put(stu)

Get() is used to get an object from the object pool, because the return value is interface{}, so it requires type conversion.

Put() is used to put the object back into the object pool when it is finished being used.

2.4.4 Performance Differences

Let’s take bytes.Buffer as an example and use sync.Pool to reuse bytes.Buffer objects to avoid duplicate memory creation and recycling to see the performance improvement.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
var bufferPool = sync.Pool{
 New: func() interface{} {
  return &bytes.Buffer{}
 },
}

var data = make([]byte, 10000)

func BenchmarkBufferWithPool(b *testing.B) {
 for n := 0; n < b.N; n++ {
  buf := bufferPool.Get().(*bytes.Buffer)
  buf.Write(data)
  buf.Reset()
  bufferPool.Put(buf)
 }
}

func BenchmarkBuffer(b *testing.B) {
 for n := 0; n < b.N; n++ {
  var buf bytes.Buffer
  buf.Write(data)
 }
}

The test results are as follows.

1
2
3
4
5
6
7
8
9
go test -bench=. -benchmem main/pool
goos: darwin
goarch: amd64
pkg: main/pool
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkBufferWithPool-12      11987966                97.12 ns/op            0 B/op          0 allocs/op
BenchmarkBuffer-12               1246887              1020 ns/op           10240 B/op          1 allocs/op
PASS
ok      main/pool       3.510s

This example creates a pool of bytes.Buffer objects and performs only Write operations and one data copy at a time, which takes almost negligible time. The time spent on memory allocation and recycling is more significant and therefore has a greater impact on the overall performance of the program. The test results also show that with the Pool multiplexed object, there is no more memory allocation for each operation.

2.4.5 Applications in the standard library

The Go standard library also makes extensive use of sync.Pool, such as fmt and encoding/json. fmt package is an example of how it uses sync.Pool.

We can take a look at the most commonly used standard formatting output function, the Printf() function.

1
2
3
4
5
// Printf formats according to a format specifier and writes to standard output.
// It returns the number of bytes written and any write error encountered.
func Printf(format string, a ...interface{}) (n int, err error) {
 return Fprintf(os.Stdout, format, a...)
}

Continue with the definition of Fprintf().

1
2
3
4
5
6
7
8
9
// Fprintf formats according to a format specifier and writes to w.
// It returns the number of bytes written and any write error encountered.
func Fprintf(w io.Writer, format string, a ...interface{}) (n int, err error) {
 p := newPrinter()
 p.doPrintf(format, a)
 n, err = w.Write(p.buf)
 p.free()
 return
}

The argument to the Fprintf() function is an io.Writer, and the argument passed to Fprintf() by Printf() is os.Stdout, which is equivalent to direct output to standard output. The newPrinter here uses the sync.Pool.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// go version go1.17 darwin/amd64

// pp is used to store a printer's state and is reused with sync.Pool to avoid allocations.
type pp struct {
    buf buffer
    ...
}

var ppFree = sync.Pool{
 New: func() interface{} { return new(pp) },
}

// newPrinter allocates a new pp struct or grabs a cached one.
func newPrinter() *pp {
 p := ppFree.Get().(*pp)
 p.panicking = false
 p.erroring = false
 p.wrapErrs = false
 p.fmt.init(&p.buf)
 return p
}

// free saves used pp structs in ppFree; avoids an allocation per invocation.
func (p *pp) free() {
 // Proper usage of a sync.Pool requires each entry to have approximately
 // the same memory cost. To obtain this property when the stored type
 // contains a variably-sized buffer, we add a hard limit on the maximum buffer
 // to place back in the pool.
 //
 // See https://golang.org/issue/23199
 if cap(p.buf) > 64<<10 {
  return
 }

 p.buf = p.buf[:0]
 p.arg = nil
 p.value = reflect.Value{}
 p.wrappedErr = nil
 ppFree.Put(p)
}

The calls to fmt.Printf() are very frequent, and using sync.Pool to reuse pp objects can greatly improve performance, reduce memory usage, and reduce GC pressure.

3. Concurrent Programming

3.1 About locks

3.1.1 Lock-free

Locking is to avoid security issues arising from simultaneous access to shared resources in a concurrent environment. So, is it necessary to add locks in a concurrent environment? The answer is no. Not all concurrency requires locking. Appropriately reducing the granularity of locks, or even adopting a lock-free design, is more likely to improve concurrency.

There are two main implementations of locklessness, lock-free data structures and serial lock-free.

Lock-free data structure

Lock-free data structures can be implemented using hardware-supported atomic operations, which are lock-free, concurrency-safe, and whose performance scales linearly with the number of CPUs. Many languages provide CAS atomic operations (e.g., the atomic package in Go and the atomic library in C++11) that can be used to implement lock-free data structures, such as lock-free linked lists.

Let’s take a simple, thread-safe Singly linked list insert operation to see the difference between lock-free programming and normal locking.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
package list

import (
 "fmt"
 "sync"
 "sync/atomic"

 "golang.org/x/sync/errgroup"
)

// Node 链表节点
type Node struct {
 Value interface{}
 Next  *Node
}

//
// 有锁单向链表的简单实现
//

// WithLockList 有锁单向链表
type WithLockList struct {
 Head *Node
 mu   sync.Mutex
}

// Push 将元素插入到链表的首部
func (l *WithLockList) Push(v interface{}) {
 l.mu.Lock()
 defer l.mu.Unlock()
 n := &Node{
  Value: v,
  Next:  l.Head,
 }
 l.Head = n
}

// String 有锁链表的字符串形式输出
func (l WithLockList) String() string {
 s := ""
 cur := l.Head
 for {
  if cur == nil {
   break
  }
  if s != "" {
   s += ","
  }
  s += fmt.Sprintf("%v", cur.Value)
  cur = cur.Next
 }
 return s
}

//
// 无锁单向链表的简单实现
//

// LockFreeList 无锁单向链表
type LockFreeList struct {
 Head atomic.Value
}

// Push 有锁
func (l *LockFreeList) Push(v interface{}) {
 for {
  head := l.Head.Load()
  headNode, _ := head.(*Node)
  n := &Node{
   Value: v,
   Next:  headNode,
  }
  if l.Head.CompareAndSwap(head, n) {
   break
  }
 }
}

// String 有锁链表的字符串形式输出
func (l LockFreeList) String() string {
 s := ""
 cur := l.Head.Load().(*Node)
 for {
  if cur == nil {
   break
  }
  if s != "" {
   s += ","
  }
  s += fmt.Sprintf("%v", cur.Value)
  cur = cur.Next
 }
 return s
}

There are a few points to note about the implementation of Singly linked list.

  1. lock-free Singly linked list implementation requires a CAS operation when inserting, i.e., call the CompareAndSwap() method to insert, and if the insertion fails, a for loop is performed to try several times until success.
  2. To facilitate the printing of linked list contents, implement a String() method to traverse the linked list.

Let’s do a concurrent write operation on each of the two linked lists to verify their functionality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package main

import (
 "fmt"
 
 "main/list"
)

// ConcurWriteWithLockList 并发写入有锁链表
func ConcurWriteWithLockList(l *WithLockList) {
 var g errgroup.Group
 // 10 个协程并发写入链表
 for i := 0; i < 10; i++ {
  i := i
  g.Go(func() error {
   l.Push(i)
   return nil
  })
 }
 _ = g.Wait()
}

// ConcurWriteLockFreeList 并发写入无锁链表
func ConcurWriteLockFreeList(l *LockFreeList) {
 var g errgroup.Group
 // 10 个协程并发写入链表
 for i := 0; i < 10; i++ {
  i := i
  g.Go(func() error {
   l.Push(i)
   return nil
  })
 }
 _ = g.Wait()
}

func main() {
 // 并发写入与遍历打印有锁链表
 l1 := &list.WithLockList{}
 list.ConcurWriteWithLockList(l1)
 fmt.Println(l1)

 // 并发写入与遍历打印无锁链表
 l2 := &list.LockFreeList{}
 list.ConcurWriteLockFreeList(l2)
 fmt.Println(l2)
}

Note that the results of running the main() function above multiple times may not be the same, because concurrency is out of order.

1
2
8,7,6,9,5,4,3,1,2,0
9,8,7,6,5,4,3,2,0,1

Here’s another look at the benchmarking of linked list Push operations to compare the performance difference between locked and unlocked.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func BenchmarkWriteWithLockList(b *testing.B) {
 l := &WithLockList{}
 for n := 0; n < b.N; n++ {
  l.Push(n)
 }
}
BenchmarkWriteWithLockList-8    14234166                83.58 ns/op

func BenchmarkWriteLockFreeList(b *testing.B) {
 l := &LockFreeList{}
 for n := 0; n < b.N; n++ {
  l.Push(n)
 }
}
BenchmarkWriteLockFreeList-8    15219405                73.15 ns/op

You can see that the unlocked version has a higher performance than the locked version.

Serial lock-free

Serial lock-free is an idea that avoids the use of locks by avoiding concurrent access to shared resources and instead each concurrent operation accesses its own exclusive resource to achieve the effect of serial access to resources. There are different implementations for different scenarios. For example, a single Reactor multi-threaded model is replaced by a master-slave Reactor multi-threaded model in a network I/O scenario to avoid locking reads on the same message queue.

Here I present a situation that is often encountered in backend microservice development. We often need to concurrently pull information from multiple sources and aggregate it to a single variable. Then there is a situation where the same variable is written to mutually exclusive. For example, if we write to a map, we can write the result of each goroutine to a temporary object, so that the concurrent goroutines are unbound from the same variable, and then aggregated together without using locks. That is, they are processed independently and then merged.

Serial lock-free

To simulate the above situation, simply write a sample program to compare the performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import (
 "sync"

 "golang.org/x/sync/errgroup"
)

// ConcurWriteMapWithLock 有锁并发写入 map
func ConcurWriteMapWithLock() map[int]int {
 m := make(map[int]int)
 var mu sync.Mutex
 var g errgroup.Group
 // 10 个协程并发写入 map
 for i := 0; i < 10; i++ {
  i := i
  g.Go(func() error {
   mu.Lock()
   defer mu.Unlock()
   m[i] = i * i
   return nil
  })
 }
 _ = g.Wait()
 return m
}

// ConcurWriteMapLockFree 无锁并发写入 map
func ConcurWriteMapLockFree() map[int]int {
 m := make(map[int]int)
 // 每个协程独占一 value
 values := make([]int, 10)
 // 10 个协程并发写入 map
 var g errgroup.Group
 for i := 0; i < 10; i++ {
  i := i
  g.Go(func() error {
   values[i] = i * i
   return nil
  })
 }
 _ = g.Wait()
 // 汇聚结果到 map
 for i, v := range values {
  m[i] = v
 }
 return m
}

Look at the performance differences between the two.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func BenchmarkConcurWriteMapWithLock(b *testing.B) {
 for n := 0; n < b.N; n++ {
  _ = ConcurWriteMapWithLock()
 }
}
BenchmarkConcurWriteMapWithLock-8         218673              5089 ns/op

func BenchmarkConcurWriteMapLockFree(b *testing.B) {
 for n := 0; n < b.N; n++ {
  _ = ConcurWriteMapLockFree()
 }
}
BenchmarkConcurWriteMapLockFree-8         316635              4048 ns/op

3.1.2 Reducing lock contention

If locking is unavoidable, you can use slicing to reduce the number of locks on resources, which can also improve the overall performance.

For example, Golang’s excellent local cache components bigcache, go-cache, and freecache all implement sharding, with one lock per shard, to reduce the number of locking sessions and improve overall performance.

As a simple example, let’s compare the concurrent writes of map[uint64]struct{} before and after slicing to see the performance improvement from reducing lock contention.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
var (
 num = 1000000
 m0  = make(map[int]struct{}, num)
 mu0 = sync.RWMutex{}
 m1  = make(map[int]struct{}, num)
 mu1 = sync.RWMutex{}
)

// ConWriteMapNoShard 不分片写入一个 map。
func ConWriteMapNoShard() {
 g := errgroup.Group{}
 for i := 0; i < num; i++ {
  g.Go(func() error {
   mu0.Lock()
   defer mu0.Unlock()
   m0[i] = struct{}{}
   return nil
  })
 }
 _ = g.Wait()
}

// ConWriteMapTwoShard 分片写入两个 map。
func ConWriteMapTwoShard() {
 g := errgroup.Group{}
 for i := 0; i < num; i++ {
  g.Go(func() error {
   if i&1 == 0 {
    mu0.Lock()
    defer mu0.Unlock()
    m0[i] = struct{}{}
    return nil
   }
   mu1.Lock()
   defer mu1.Unlock()
   m1[i] = struct{}{}
   return nil
  })
 }
 _ = g.Wait()
}

Look at the performance differences between the two.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func BenchmarkConWriteMapNoShard(b *testing.B) {
 for i := 0; i < b.N; i++ {
  ConWriteMapNoShard()
 }
}
BenchmarkConWriteMapNoShard-12                 3         472063245 ns/op

func BenchmarkConWriteMapTwoShard(b *testing.B) {
 for i := 0; i < b.N; i++ {
  ConWriteMapTwoShard()
 }
}
BenchmarkConWriteMapTwoShard-12                4         310588155 ns/op

As you can see, by slicing the sub-shared resources, lock contention is reduced and the concurrent performance of the program is significantly improved. Predictably, as the granularity of slicing gets smaller, the performance gap will get bigger and bigger. Of course, the smaller the slice granularity, the better. Because each slice has to be assigned a lock, it will bring a lot of extra unnecessary overhead. You can choose a value that is not too large to find a balance between performance and cost.

2.1.3 Prefer shared locks to mutually exclusive locks

If concurrency is not lock-free, give preference to shared locks over mutually exclusive locks.

A mutually exclusive lock is a lock that can only be acquired by one Goroutine. Shared locks are locks that can be acquired by multiple Goroutines at the same time.

The Go standard library sync provides two types of locks, mutex locks (sync.Mutex) and read-write locks (sync.RWMutex), and read-write locks are a concrete implementation of shared locks.

sync.Mutex

Mutex locks are used to ensure that a shared resource can only be occupied by one Goroutine at a time, and that if one Goroutine is occupied, the other Goroutines block and wait.

sync.Mutex

sync.Mutex provides two export methods to use locks.

1
2
Lock()   // 加锁
Unlock()   // 释放锁

We can lock a shared resource by using the Lock method before accessing it and release the lock by calling the Unlock method after accessing the shared resource, or we can use the defer statement to ensure that the mutually exclusive lock will be unlocked. After a Goroutine has called the Lock method to obtain a lock, all other Goroutines requesting the lock will block on the Lock method until the lock is released.

sync.RWMutex

A read-write lock is a shared lock, also called multiple readers, single writer lock. When using the lock, a distinction is made between the operations for which the lock is obtained, one for reading and one for writing. Since multiple Gorouines are allowed to acquire a read lock at the same time, it is a shared lock. However, write locks are mutually exclusive.

In general, there are several cases as follows.

  • The read locks are not mutually exclusive with each other, there are no write locks, and multiple goroutines can obtain read locks at the same time.
  • The write locks are mutually exclusive with each other, and there exist write locks and other write locks blocking.
  • Write locks and read locks are mutually exclusive. If a read lock exists, the write lock blocks, and if a write lock exists, the read lock blocks.

sync.RWMutex

sync.RWMutex provides five export methods to use locks.

1
2
3
4
5
Lock()    // 加写锁
Unlock()   // 释放写锁
RLock()    // 加读锁
RUnlock()   // 释放读锁
RLocker() Locker // 返回读锁,使用 Lock() 和 Unlock() 进行 RLock() 和 RUnlock()

Read-write locks exist to solve performance problems when there are more reads and fewer writes. Read-write locks are effective in reducing lock blocking time when there are more read scenarios.

Performance comparison.

Most business scenarios are read more and write less, so using read-write locks can effectively improve the efficiency of access to shared data. In the worst case, if there are only write requests, then the read/write locks degrade to mutually exclusive locks at best. So prioritizing read-write locks over mutually exclusive locks can improve the concurrency performance of the program.

Next, we test the performance difference between mutually exclusive locks and read-write locks in three scenarios.

  • Read more than write (80% of read)
  • Read and write together (50% each)
  • Read less write more (read 20%)

First, concurrent reads and writes to the shared map are implemented according to the mutex lock and read-write lock, respectively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// OpMapWithMutex 使用互斥锁读写 map。
// rpct 为读操作占比。
func OpMapWithMutex(rpct int) {
 m := make(map[int]struct{})
 mu := sync.Mutex{}
 var wg sync.WaitGroup
 for i := 0; i < 100; i++ {
  i := i
  wg.Add(1)
  go func() {
   defer wg.Done()
   mu.Lock()
   defer mu.Unlock()
   // 写操作。
   if i >= rpct {
    m[i] = struct{}{}
    time.Sleep(time.Microsecond)
    return
   }
   // 读操作。
   _ = m[i]
   time.Sleep(time.Microsecond)
  }()
 }
 wg.Wait()
}

// OpMapWithRWMutex 使用读写锁读写 map。
// rpct 为读操作占比。
func OpMapWithRWMutex(rpct int) {
 m := make(map[int]struct{})
 mu := sync.RWMutex{}
 var wg sync.WaitGroup
 for i := 0; i < 100; i++ {
  i := i
  wg.Add(1)
  go func() {
   defer wg.Done()
   // 写操作。
   if i >= rpct {
    mu.Lock()
    defer mu.Unlock()
    m[i] = struct{}{}
    time.Sleep(time.Microsecond)
    return
   }
   // 读操作。
   mu.RLock()
   defer mu.RUnlock()
   _ = m[i]
   time.Sleep(time.Microsecond)
  }()
 }
 wg.Wait()
}

The input rpct is used to adjust the percentage of read operations to simulate scenarios with different percentages of reads and writes. rpct set to 80 means more reads and less writes (80% reads), rpct set to 50 means the same reads and writes (50% each), rpct set to 20 means less reads and more writes (20% reads).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
func BenchmarkMutexReadMore(b *testing.B) {
 for i := 0; i < b.N; i++ {
  OpMapWithMutex(80)
 }
}

func BenchmarkRWMutexReadMore(b *testing.B) {
 for i := 0; i < b.N; i++ {
  OpMapWithRWMutex(80)
 }
}

func BenchmarkMutexRWEqual(b *testing.B) {
 for i := 0; i < b.N; i++ {
  OpMapWithMutex(50)
 }
}

func BenchmarkRWMutexRWEqual(b *testing.B) {
 for i := 0; i < b.N; i++ {
  OpMapWithRWMutex(50)
 }
}

func BenchmarkMutexWriteMore(b *testing.B) {
 for i := 0; i < b.N; i++ {
  OpMapWithMutex(20)
 }
}

func BenchmarkRWMutexWriteMore(b *testing.B) {
 for i := 0; i < b.N; i++ {
  OpMapWithRWMutex(20)
 }
}

Execute all benchmark tests under the current package with the following results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
dablelv@DABLELV-MB0 mutex % go test -bench=.
goos: darwin
goarch: amd64
pkg: main/mutex
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkMutexReadMore-12                   2462            485917 ns/op
BenchmarkRWMutexReadMore-12                 8074            145690 ns/op
BenchmarkMutexRWEqual-12                    2406            498673 ns/op
BenchmarkRWMutexRWEqual-12                  4124            303693 ns/op
BenchmarkMutexWriteMore-12                  1906            532350 ns/op
BenchmarkRWMutexWriteMore-12                2462            432386 ns/op
PASS
ok      main/mutex      9.532s

It can be seen that the use of read-write lock concurrency performance will be better in scenarios where there are more reads and fewer writes. It can be expected that if the write ratio is lower, then the concurrency with read-write locking will be even better.

It should be noted here that since each read/write map operation takes a very short time, it is necessary to sleep one microsecond (one millionth of a second) each time to increase the time consumed, otherwise the time consumed for accessing shared resources is less than the time consumed for lock processing itself, then the performance optimization effect brought by using read/write locks will become less obvious or even degrade the performance.

3.2 Limiting the number of Goroutines

3.2.1 Problems with too many Goroutines

Program crashes.

Goroutine is a lightweight thread managed by the Go runtime. It allows us to easily implement concurrent programming. But when we open an unlimited number of Goroutines, we run into a fatal problem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func main() {
 var wg sync.WaitGroup
 for i := 0; i < math.MaxInt32; i++ {
  wg.Add(1)
  go func(i int) {
   defer wg.Done()
   fmt.Println(i)
   time.Sleep(time.Second)
  }(i)
 }
 wg.Wait()
}

This example implements the concurrency of math.MaxInt32 Goroutines, 2^31 - 1 is about 2 billion, and each Goroutine does almost nothing internally. Under normal circumstances, the program would output 0 to 2^31-1 numbers in random order.

Will the program run as smoothly as expected?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
go run main.go
...
108668
1142025
panic: too many concurrent operations on a single file or socket (max 1048575)

goroutine 1158408 [running]:
internal/poll.(*fdMutex).rwlock(0xc0000ae060, 0x0)
        /usr/local/go/src/internal/poll/fd_mutex.go:147 +0x11b
internal/poll.(*FD).writeLock(...)
        /usr/local/go/src/internal/poll/fd_mutex.go:239
internal/poll.(*FD).Write(0xc0000ae060, {0xc12cadf690, 0x8, 0x8})
        /usr/local/go/src/internal/poll/fd_unix.go:262 +0x72
os.(*File).write(...)
        /usr/local/go/src/os/file_posix.go:49
os.(*File).Write(0xc0000ac008, {0xc12cadf690, 0x1, 0xc12ea62f50})
        /usr/local/go/src/os/file.go:176 +0x65
fmt.Fprintln({0x10c00e0, 0xc0000ac008}, {0xc12ea62f90, 0x1, 0x1})
        /usr/local/go/src/fmt/print.go:265 +0x75
fmt.Println(...)
        /usr/local/go/src/fmt/print.go:274
main.main.func1(0x0)
        /Users/dablelv/work/code/test/main.go:16 +0x8f
...

The result of the run is that the program crashes straight away, with the following key error message.

1
panic: too many concurrent operations on a single file or socket (max 1048575)

The number of concurrent operations on a single file/socket exceeds the system limit. This error is caused by the fmt.Printf function, which prints the formatted string to the screen, i.e., the standard output. In Linux, standard output can also be considered a file, and the Kernel uses File Descriptor to access files, with a file descriptor of 1 for standard output, 2 for error output, and 0 for standard input.

In short, the system is running out of resources.

So what if we remove the fmt.Printf line of code? Then the program will probably crash due to lack of memory. This is better understood as each Goroutine needs to consume at least 2KB of space, so assuming the computer has 4GB of memory, then at most 4GB/2KB = 1M Goroutines are allowed to exist at the same time. Then if there are other operations in the Goroutine that require memory allocation, the number of Goroutines allowed to execute concurrently will decrease by an order of magnitude.

Cost of Goroutine.

The previous example is too extreme, and in general programs do not open Goroutines indefinitely, aiming to show that there is a limit to the number of Goroutines and that they cannot be opened indefinitely.

What if we open a lot of Goroutines, but do not cause the program to crash? If we really want to do that, we should be clear that Goroutines are lightweight but still have overhead.

Go’s overhead is mainly in three areas: creation (taking up memory), scheduling (adding to the scheduler load), and deletion (adding to GC pressure).

  1. Memory overhead

    In terms of space, a Goroutine takes up about 2K of memory. In the source src/runtime/runtime2.go, we can find the structure definition type g struct of the goroutine.

  2. Scheduling overhead

    In time, Goroutine scheduling also has a CPU overhead. We can use runntime.Gosched() to let the current concurrent process actively give up the CPU to execute another Goroutine, here is a look at the time consumption of switching between concurrent processes.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    
    const NUM = 10000
    
    func cal() {
    for i := 0; i < NUM; i++ {
    runtime.Gosched()
    }
    }
    
    func main() {
    // 只设置一个 Processor
    runtime.GOMAXPROCS(1)
    start := time.Now().UnixNano()
    go cal()
    for i := 0; i < NUM; i++ {
    runtime.Gosched()
    }
    end := time.Now().UnixNano()
    fmt.Printf("total %vns per %vns", end-start, (end-start)/NUM)
    }
    

    The output is as follows.

    1
    
    total 997200ns per 99ns
    

    It can be seen that a Goroutine switch takes about 100ns, which is a very good performance compared to microsecond switches of threads, but still has an overhead.

  3. GC Overhead

    The memory resources occupied by a Goroutine from its creation to the end of its operation need to be reclaimed by the GC, and if a large number of Goroutines are created endlessly, it will inevitably put pressure on the GC.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    
    package main
    
    import (
    "fmt"
    "runtime"
    "runtime/debug"
    "sync"
    "time"
    )
    
    func createLargeNumGoroutine(num int, wg *sync.WaitGroup) {
    wg.Add(num)
    for i := 0; i < num; i++ {
    go func() {
    defer wg.Done()
    }()
    }
    }
    
    func main() {
    // 只设置一个 Processor 保证 Go 程串行执行
    runtime.GOMAXPROCS(1)
    // 关闭GC改为手动执行
    debug.SetGCPercent(-1)
    
    var wg sync.WaitGroup
    createLargeNumGoroutine(1000, &wg)
    wg.Wait()
    t := time.Now()
    runtime.GC() // 手动GC
    cost := time.Since(t)
    fmt.Printf("GC cost %v when goroutine num is %v\n", cost, 1000)
    
    createLargeNumGoroutine(10000, &wg)
    wg.Wait()
    t = time.Now()
    runtime.GC() // 手动GC
    cost = time.Since(t)
    fmt.Printf("GC cost %v when goroutine num is %v\n", cost, 10000)
    
    createLargeNumGoroutine(100000, &wg)
    wg.Wait()
    t = time.Now()
    runtime.GC() // 手动GC
    cost = time.Since(t)
    fmt.Printf("GC cost %v when goroutine num is %v\n", cost, 100000)
    }
    

    The output is as follows.

    1
    2
    3
    
    GC cost 0s when goroutine num is 1000
    GC cost 2.0027ms when goroutine num is 10000
    GC cost 30.9523ms when goroutine num is 100000
    

    As the number of Goroutines created increases, the GC time consumption increases.

    The purpose of the above analysis is to quantify the overhead of Goroutines as much as possible. Although it is officially claimed that there is no pressure to create thousands of Goroutines when writing concurrent programs in Golang, the light overhead of Goroutines will be magnified when we create 100,000, 1,000,000 or even 10,000,000 Goroutines.

3.2.2 Limiting the number of Goroutines

To protect the program and improve performance, we should actively limit the number of concurrent Goroutines.

This can be done by using the buffer size of the channel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func main() {
 var wg sync.WaitGroup
 ch := make(chan struct{}, 3)
 for i := 0; i < 10; i++ {
  ch <- struct{}{}
  wg.Add(1)
  go func(i int) {
   defer wg.Done()
   log.Println(i)
   time.Sleep(time.Second)
   <-ch
  }(i)
 }
 wg.Wait()
}

The above example creates a channel with a buffer size of 3. If no messages are received, up to 3 messages are sent and blocked. Before starting Goroutine, call ch <-struct{}{} and block if the buffer is full. ch <-ch is called at the end of the Goroutine task to free the buffer.

sync.WaitGroup is not required, e.g. for Http services, where each request is naturally concurrent, sync.WaitGroup is not needed when using channels to control the number of concurrently processed tasks.

The results of the run are as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
2022/03/06 20:37:02 0
2022/03/06 20:37:02 2
2022/03/06 20:37:02 1
2022/03/06 20:37:03 3
2022/03/06 20:37:03 4
2022/03/06 20:37:03 5
2022/03/06 20:37:04 6
2022/03/06 20:37:04 7
2022/03/06 20:37:04 8
2022/03/06 20:37:05 9

It is easy to see from the logs that only 3 tasks are executed concurrently per second, achieving Goroutine’s concurrency control purpose.

3.2.3 Goroutine pooling

The above example simply limits the number of Goroutines that can be opened. On top of that, based on the idea of object reuse, we can reuse the opened Goroutines to avoid repeated creation and destruction of Goroutines and achieve the effect of pooling.

Goroutine pooling, we could write a Goroutine pool ourselves, but it is not recommended to do so. Because there are already mature open-source libraries available, there is no need to build tools over and over again. There are many third-party libraries that implement Goroutine pools and can be easily used to control the number of concurrent Goroutines, some of the more popular ones are as follows.

The following is a brief introduction to its use, using panjf2000/ants as an example.

ants is an easy-to-use high-performance Goroutine pool that enables scheduling management and reuse of large-scale Goroutines, allowing users to limit the number of Goroutines while developing concurrent programs and reuse Goroutines to achieve more efficient task execution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
package main

import (
 "fmt"
 "time"

 "github.com/panjf2000/ants"
)

func main() {
 // Use the common pool
 for i := 0; i < 10; i++ {
  i := i
  ants.Submit(func() {
   fmt.Println(i)
  })
 }
 time.Sleep(time.Second)
}

With ants, we simply use its default Goroutine pool to submit tasks for concurrent execution directly. The default Goroutine pool has a default capacity of math.MaxInt32.

If you customize the Goroutine pool size, you can call the NewPool method to instantiate a pool with the given capacity, as follows.

1
2
// Set 10000 the size of goroutine pool
p, _ := ants.NewPool(10000)

3.2.4 Summary

Golang is built for concurrency, and Goroutines are lightweight threads managed by the Go runtime that make it easy to program concurrently; although Goroutines are lightweight, there is no such thing as a free lunch, and opening an endless number of Goroutines is bound to have performance implications and even crash programs. Therefore, we should control the number of Goroutines as much as possible, and reuse it if necessary.

3.3 Using sync.Once to avoid repeated executions

3.3.1 Introduction

sync.Once is an implementation of the Go standard library that allows functions to be executed only once, often in singleton mode, for example to initialize a configuration, keep a database connection, etc. It is similar to the init function, but with some differences.

  • The init function is executed when the package is first loaded, which wastes memory and extends program load time if it is not used later.
  • sync.Once can be initialized and called anywhere in the code, so it can be delayed until it is used and is thread-safe in concurrency scenarios.

In most cases, sync.Once is used to control the initialization of a variable, which is read or written to satisfy the following three conditions.

  • initialization (writing) is performed when and only when a variable is accessed for the first time.
  • all reads are blocked during the initialization of the variable until the initialization is complete.
  • The variable is initialized only once and resides in memory after initialization is complete.

3.3.2 Principle

sync.Once is used to ensure that a function is executed only once. To achieve this, two things need to be done.

  • A counter, which counts the number of times the function has been executed.
  • Thread safety, guaranteeing that the function is still executed only once in case of multiple goroutines, such as locks.

Source code.

Here’s a look at the sync.Once structure, which has two variables. It uses done to count the number of function executions and m to make it thread-safe. As it turns out, it is the same as the above guess.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Once is an object that will perform exactly one action.
//
// A Once must not be copied after first use.
type Once struct {
 // done indicates whether the action has been performed.
 // It is first in the struct because it is used in the hot path.
 // The hot path is inlined at every call site.
 // Placing done first allows more compact instructions on some architectures (amd64/386),
 // and fewer instructions (to calculate offset) on other architectures.
 done uint32
 m    Mutex
}

sync.Once provides only one export method, Do(), and the argument f is a function that will only be executed once, usually as an object initialization function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// go version go1.17 darwin/amd64

// Do calls the function f if and only if Do is being called for the
// first time for this instance of Once. In other words, given
//  var once Once
// if once.Do(f) is called multiple times, only the first call will invoke f,
// even if f has a different value in each invocation. A new instance of
// Once is required for each function to execute.
//
// Do is intended for initialization that must be run exactly once. Since f
// is niladic, it may be necessary to use a function literal to capture the
// arguments to a function to be invoked by Do:
//  config.once.Do(func() { config.init(filename) })
//
// Because no call to Do returns until the one call to f returns, if f causes
// Do to be called, it will deadlock.
//
// If f panics, Do considers it to have returned; future calls of Do return
// without calling f.
//
func (o *Once) Do(f func()) {
 // Note: Here is an incorrect implementation of Do:
 //
 // if atomic.CompareAndSwapUint32(&o.done, 0, 1) {
 //  f()
 // }
 //
 // Do guarantees that when it returns, f has finished.
 // This implementation would not implement that guarantee:
 // given two simultaneous calls, the winner of the cas would
 // call f, and the second would return immediately, without
 // waiting for the first's call to f to complete.
 // This is why the slow path falls back to a mutex, and why
 // the atomic.StoreUint32 must be delayed until after f returns.

 if atomic.LoadUint32(&o.done) == 0 {
  // Outlined slow-path to allow inlining of the fast-path.
  o.doSlow(f)
 }
}

func (o *Once) doSlow(f func()) {
 o.m.Lock()
 defer o.m.Unlock()
 if o.done == 0 {
  defer atomic.StoreUint32(&o.done, 1)
  f()
 }
}

Leaving aside the large comments, you can see that the sync.Once implementation is very clean. The Do() function determines whether to execute the incoming task function by judging the member variable done. Before executing the task function, a lock ensures that the execution of the task function and the modification of done is a mutually exclusive operation. Before executing the task function, a secondary judgment is made on done to ensure that the task function will only be executed once and done will only be modified once.

Why done is the first field.

A comment precedes the done field, explaining why done is the first field.

done is placed first in the hot path to reduce the number of CPU instructions, i.e., to improve performance.

The hot path is a series of instructions that the program executes very frequently. sync.Once accesses o.done in most scenarios, which is better understood on the hot path. If the hot path is compiled with fewer and more direct machine code instructions, it is bound to improve performance.

Why is it possible to reduce instructions by putting them in the first field? Because the address of the first field of the structure and the pointer to the structure are the same, if it is the first field, directly to the structure of the pointer dereference can be. If it is the other field, in addition to the pointer to the structure, the offset from the first value (calculate offset) needs to be calculated. In machine code, the offset is an additional value passed with the instruction, and the CPU needs to do an addition operation of the offset value to the pointer to get the address of the value to be accessed. Because of this, the machine code to access the first field is more compact and faster.

https://stackoverflow.com/questions/59174176/what-does-hot-path-mean-in-the-context-of-sync-once

3.3.3 Performance Differences

Let’s take a simple example to illustrate the performance difference between using sync.Once to guarantee that a function will be executed only once and multiple times.

Consider a simple scenario where the function ReadConfig needs to read the environment variables and convert them to the corresponding configuration. ReadConfig may be called concurrently by multiple Goroutines, and to improve performance (reduce execution time and memory usage), using sync.Once is a better way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
type Config struct {
 GoRoot string
 GoPath string
}

var (
 once   sync.Once
 config *Config
)

func ReadConfigWithOnce() *Config {
 once.Do(func() {
  config = &Config{
   GoRoot: os.Getenv("GOROOT"),
   GoPath: os.Getenv("GOPATH"),
  }
 })
 return config
}

func ReadConfig() *Config {
 return &Config{
  GoRoot: os.Getenv("GOROOT"),
  GoPath: os.Getenv("GOPATH"),
 }
}

Let’s look at the performance differences between the two.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func BenchmarkReadConfigWithOnce(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _ = ReadConfigWithOnce()
 }
}

func BenchmarkReadConfig(b *testing.B) {
 for i := 0; i < b.N; i++ {
  _ = ReadConfig()
 }
}

The results of the execution tests are as follows.

1
2
3
4
5
6
7
8
9
go test -bench=. main/once
goos: darwin
goarch: amd64
pkg: main/once
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkReadConfigWithOnce-12          670438965                1.732 ns/op
BenchmarkReadConfig-12                  13339154                87.46 ns/op
PASS
ok      main/once       3.006s

sync.Once ensures that the Config initialization function is executed only once, avoiding repeated initializations, which is useful in concurrent environments.

3.4 Notifying Goroutine with sync.Cond

3.4.1 Introduction

sync.Cond is a conditional variable based on a mutex/read/write lock implementation to coordinate those Goroutines that want to access a shared resource. sync.Cond can be used to notify Goroutines that are blocking while waiting for the condition to occur when the state of the shared resource changes.

sync.Cond is based on a mutex/read-write lock, what is the difference between it and a mutex lock?

The mutex lock sync.Mutex is usually used to protect shared critical resources, and the conditional variable sync.Cond is used to coordinate Goroutines that want to access the shared resources. sync.Cond can be used to notify Goroutines that are blocked when the state of a shared resource changes.

3.4.2 Usage Scenarios

sync.Cond is often used in scenarios where multiple Goroutines are waiting and one Goroutine is notified (event occurs). If it’s one notification and one wait, using a mutex lock or channel will do the trick.

Let’s imagine a very simple scenario.

One Goroutine is receiving data asynchronously, and the remaining Goroutines must wait for this concurrent process to finish receiving data before they can read the correct data. In this case, if we simply use chan or mutex locks, only one Goroutine can wait and read the data, and there is no way to notify the other Goroutines to read the data as well.

At this point, it is necessary to have a global variable to mark whether the first Goroutine has finished accepting data, and the remaining Goroutines will check the value of this variable repeatedly until the requirement is met. Or create multiple channels, each Goroutine blocks on one channel, and the Goroutine receiving the data is notified one by one when the data is received. In short, extra complexity is needed to get this done.

Go has a built-in sync.Cond in the standard library sync to solve this kind of problem.

3.4.3 Principle

The sync.Cond maintains an internal wait queue of all goroutines waiting for the sync.Cond, i.e. a list of notifications. sync.Cond can be used to wake up one or all goroutines that are blocked by a wait condition variable, thus synchronizing multiple Goroutines.

The definition of sync.Cond is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Cond implements a condition variable, a rendezvous point
// for goroutines waiting for or announcing the occurrence
// of an event.
//
// Each Cond has an associated Locker L (often a *Mutex or *RWMutex),
// which must be held when changing the condition and
// when calling the Wait method.
//
// A Cond must not be copied after first use.
type Cond struct {
 noCopy noCopy

 // L is held while observing or changing the condition
 L Locker

 notify  notifyList
 checker copyChecker
}

Each Cond instance is associated with a lock L (mutex lock *Mutex, or read/write lock *RWMutex) that must be locked when modifying conditions or calling the Wait method.

The four member functions of sync.Cond are defined as follows.

1
2
3
4
// NewCond returns a new Cond with Locker l.
func NewCond(l Locker) *Cond {
 return &Cond{L: l}
}

When NewCond creates a Cond instance, a lock needs to be associated with it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Wait atomically unlocks c.L and suspends execution
// of the calling goroutine. After later resuming execution,
// Wait locks c.L before returning. Unlike in other systems,
// Wait cannot return unless awoken by Broadcast or Signal.
//
// Because c.L is not locked when Wait first resumes, the caller
// typically cannot assume that the condition is true when
// Wait returns. Instead, the caller should Wait in a loop:
//
//    c.L.Lock()
//    for !condition() {
//        c.Wait()
//    }
//    ... make use of condition ...
//    c.L.Unlock()
//
func (c *Cond) Wait() {
 c.checker.check()
 t := runtime_notifyListAdd(&c.notify)
 c.L.Unlock()
 runtime_notifyListWait(&c.notify, t)
 c.L.Lock()
}

Wait is used to block the caller and wait for notification. Calling Wait automatically releases the lock c.L and hangs the goroutine where the caller is located. if another goroutine calls Signal or Broadcast to wake up the goroutine, then the Wait method will reapply the lock to c.L when it ends blocking and continue executing the code after Wait.

The reason for !condition() is used instead of if is that the condition may not be met when the current goroutine is woken up, so we need to wait again for the next wakeup. To be on the safe side, using for ensures that the condition is met before executing subsequent code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
// Signal wakes one goroutine waiting on c, if there is any.
//
// It is allowed but not required for the caller to hold c.L
// during the call.
func (c *Cond) Signal() {
 c.checker.check()
 runtime_notifyListNotifyOne(&c.notify)
}

// Broadcast wakes all goroutines waiting on c.
//
// It is allowed but not required for the caller to hold c.L
// during the call.
func (c *Cond) Broadcast() {
 c.checker.check()
 runtime_notifyListNotifyAll(&c.notify)
}

Signal wakes up only the goroutine of any 1 waiting condition variable c, without lock protection. broadcast wakes up the goroutine of all waiting condition variables c, without lock protection.

3.4.4 Usage Example

We implement a simple example where three goroutines call Wait() to wait and another goroutine calls Broadcast() to wake up all waiting goroutines.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
var done = false

func read(name string, c *sync.Cond) {
 c.L.Lock()
 for !done {
  c.Wait()
 }
 log.Println(name, "starts reading")
 c.L.Unlock()
}

func write(name string, c *sync.Cond) {
 log.Println(name, "starts writing")
 time.Sleep(time.Second)
 done = true
 log.Println(name, "wakes all")
 c.Broadcast()
}

func main() {
 cond := sync.NewCond(&sync.Mutex{})

 go read("reader1", cond)
 go read("reader2", cond)
 go read("reader3", cond)
 write("writer", cond)

 time.Sleep(time.Second * 3)
}
  • done is the condition for multiple Goroutine blocking waits.
  • read() calls Wait() to wait for notification until done is true.
  • write() receives the data, and when it is done, sets done to true and calls Broadcast() to notify all waiting goroutines.
  • write() pauses for 1s to simulate time consumption on the one hand, and to make sure that the previous 3 read goroutines have all reached Wait() and are in the waiting state on the other. main function pauses for 3s at the end to make sure that all operations have been executed.

Output.

1
2
3
4
5
6
go run main.go
2022/03/07 17:20:09 writer starts writing
2022/03/07 17:20:10 writer wakes all
2022/03/07 17:20:10 reader3 starts reading
2022/03/07 17:20:10 reader1 starts reading
2022/03/07 17:20:10 reader2 starts reading

More discussion of sync.Cond can be found at: https://stackoverflow.com/questions/36857167/how-to-correctly-use-sync-cond

3.4.5 Caution

  1. sync.Cond cannot be copied

    The reason why sync.Cond cannot be copied is not because it has a Locker nested inside it, because the Mutex/RWMutex pointer passed in when NewCond is created is not a problem for Mutex pointer copying.

    The main reason is that sync.Cond maintains an internal Goroutine notification queue, notifyList, and if this queue is copied, then in a concurrency scenario, the notifyList.wait and notifyList.notify are not the same for different Goroutines, which can lead to This can lead to some Goroutines blocking all the time.

  2. Wakeup order

    Wake up from the waiting queue in order, first enter the waiting queue, first be woken up.

  3. Lock before calling Wait()

    Before calling Wait() function, you need to get the member lock of the condition variable first, because you need to change the wait queue of the condition variable mutually exclusive. The lock will be reapplied before Wait() returns.