It is well known that Go’s map is not a concurrency-safe data structure. If more than one goroutine reads and writes the same map, an error is reported. I’m ashamed to say that I always thought this error could be caught with recover. Until yesterday, a colleague raised this question and sent me a demo code. Today, I took the time to study the knowledge, organize it and share it with you.

Why single out this issue? Well, it starts with the design of the Sniper framework (https://taoshu.in/go/go-sniper.html). Our online system is a microservices architecture, but it still uses the Monolith repository, where all businesses share a common set of code and only one binary will be compiled. The framework starts a goroutine for each request and captures all errors/panics through recover. if the business code makes an error and it is not captured, the whole process will exit, which will affect all the requests being processed, with very serious and totally unacceptable consequences.

There are three types of errors in Go: error, panic and fatal error.

  • Error is what we often call an error, usually passed through the function return value, and needs to be handled using if err ! = nil.
  • Panic is what we sometimes call an exception, usually against exceptions in other languages. panic is triggered by array overruns, null pointers, etc., and can also be triggered by business code. This type of error can be caught using recover.
  • Fatal error is a serious error triggered by the system, which is usually related to system resources. A typical fatal error is a failure to request memory from the system. It is severe because the program cannot recover from such errors.

Because fatal errors are unrecoverable, when they occur, they cause the entire process to exit, which in turn affects all current requests. The concurrent read/write map we talked about earlier is actually a fatal error as well.

By checking the data, we found that before go 1.6, concurrent reading and writing of map did not immediately report an error. Only when concurrent reading and writing resulted in errors such as null pointers would panic be triggered, and the effect would be as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
unexpected fault address 0x0
fatal error: fault
[signal 0x7 code=0x80 addr=0x0 pc=0x40873b]

goroutine 97699 [running]:
runtime.throw(0x17f5cc0, 0x5)
    /usr/local/go/src/runtime/panic.go:527
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:21
runtime.mapassign1(0x12c6fe0, 0xc88283b998, 0xc8c9b63c68, 0xc8c9b63cd8)
    /usr/local/go/src/runtime/hashmap.go:446

This panic can be caught by recovery. However, ** this error is sporadic, difficult to reproduce, and not conducive to locating the problem quickly**. That’s why feedback on this issue has been particularly frequent in the community. To solve this problem, Go 1.6 introduced this commit. The idea is also very simple and brutal, when there is a goroutine to update the map to put a marker, other goroutinees read the content found when there is a write marker directly reported fatal error.

1
2
3
4
5
6
7
8
	if h.flags&hashWriting != 0 {
		throw("concurrent map writes")
	}
	h.flags |= hashWriting
	// ...
	if h.flags&hashWriting != 0 {
		throw("concurrent map read and map write")
	}

In fact, I do not understand why we need to use fatal error. since it has been actively detected, it will not generate dirty data. It is perfectly possible to throw a normal panic for concurrent reads and writes, so that the framework can recover and not create a local error that affects the global problem.

Although I am not aware of this problem, but our online system also did not have such problems. The main reason is that we don’t use goroutinees to write business logic as a rule. If we have to use it, we will provide encapsulated methods at the framework level to avoid using go directly to start a goroutine. That’s why we rarely use goroutine in our business code.

Concurrent programming is hard. Don’t think that go language built-in goroutine can simplify concurrent programming, it is just the threshold of concurrent programming. So people must be careful when deciding to use goroutine, and use it sparingly. We want to write obviously bug-free code and not not obviously bug-free code.