Today I saw a piece of code that exposes metrics with Prometheus’ client_go. Instead of simply inc()ing the corresponding metric counter, it implements a very strange logic of its own:

  1. when the program needs to add counter +1, it does not operate the corresponding metrics directly, but packages the metrics to be added in its own format and sends the object to a channel, each metric corresponds to a channel
  2. The program starts a globally unique worker goroutine at the beginning, which is responsible for all the metrics: it gets messages from different channels, unpacks them, finds the corresponding metrics that should be added, and then performs the final addition operation.

The actual operation is much more complicated, first creating a MetricsBuilder, then the MetricsBuilder has an Add() function, which actually sends a message to the Channel, which is read out and executed to metrics + 1 through some series of cascading calls.

It feels like it’s a one-line code metrics.Add() thing, why does it have to be so complicated? After thinking about it, I think the only possible explanation is that this is an extremely loaded system and I want to make the metrics operation asynchronous so that it doesn’t take up business processing time. But using channel also involves packaging and unpacking, so is it really fast?

At first I thought channel might be a high-performance lock-free operation, but after reading the runtime section of golang, I found that there are also locks, and if multiple threads write to a channel at the same time, there are also contention conditions.

And Prometheus’ client_golang just performs an Add operation: atomic.AddUint64(&c.valInt, ival)

Although atomic is also a CAS operation, intuitively I don’t think using channel is faster than atomic.

Two pieces of code were written to compare the two cases (the test code and the way to run it can be found here atomic_or_channel).

Directly with atomic.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
func AtomicAdd() uint64 {
    var wg sync.WaitGroup
    var count uint64

    count = 0

    for i := 1; i <= CLIENTS; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for l := 0; l < LOOP; l++ {
                atomic.AddUint64(&count, 1)
            }
        }()
    }

    wg.Wait()
    return count
}

The simulation opens a channel responsible for adding.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
func ChannelAdd() uint64 {
	var wg sync.WaitGroup
	var count uint64
 
	count = 0
 
	numCh := make(chan *uint64, 10240)
	// start worker
	go func() {
		for value := range numCh {
			atomic.AddUint64(&count, *value)
		}
	}()
 
	one := uint64(1)
	for i := 1; i <= CLIENTS; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for l := 0; l < LOOP; l++ {
				numCh <- &one
			}
		}()
	}
 
	wg.Wait()
	close(numCh)
	return count
 
}

The parameters are as follows, intended to simulate 100 parallel connections, which need to be increased by 1 million times.

1
2
var LOOP = 1000000
var CLIENTS = 100

The actual results are just as I thought, with atomic being 15 times faster than the channel approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
➜ go run .
ChannelAdd() Done! count=99991670 took 33.95414173ss
AtomicAdd() Done! count=100000000 took 1.92393346ss
➜ go test -bench=. -count=10
goos: darwin
goarch: amd64
pkg: github.com/laixintao/atomic_or_channel
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkAtomic-12             1        1947990739 ns/op
BenchmarkAtomic-12             1        1957558508 ns/op
BenchmarkAtomic-12             1        1984851549 ns/op
BenchmarkAtomic-12             1        1957678020 ns/op
BenchmarkAtomic-12             1        1953746756 ns/op
BenchmarkAtomic-12             1        1957197155 ns/op
BenchmarkAtomic-12             1        1964173594 ns/op
BenchmarkAtomic-12             1        1962058130 ns/op
BenchmarkAtomic-12             1        2342306683 ns/op
BenchmarkAtomic-12             1        2313845157 ns/op
BenchmarkChannel-12            1        35211801199 ns/op
BenchmarkChannel-12            1        40597364557 ns/op
BenchmarkChannel-12            1        38452709531 ns/op
BenchmarkChannel-12            1        40201893971 ns/op
BenchmarkChannel-12            1        41802617846 ns/op
BenchmarkChannel-12            1        41463031707 ns/op
BenchmarkChannel-12            1        41985476702 ns/op
BenchmarkChannel-12            1        43106978329 ns/op
BenchmarkChannel-12            1        45582670783 ns/op
BenchmarkChannel-12            1        43751655673 ns/op
PASS
ok      github.com/laixintao/atomic_or_channel  432.885s

Atomic can simulate 100 clients adding 1 million times in parallel in 2s, i.e. supporting 50 million QPS (and only on my laptop), while the same operation takes 30-40s using the channel approach described above. 15 times slower.

Although I have said in some places that atomic is slow, this speed is perfectly adequate for this scenario of metrics statistics.