Exploring what are the pitfalls of panic & recover in Go source code?

Preface

The reason for writing this article is that a colleague at work recently wrote a Goroutine directly using the Go keyword, and then had a null pointer problem that caused the whole program to go down because there was no recover. The code looks like this.

func main() {
    defer func() {
        if err := recover(); err !=nil{
            fmt.Println(err)
        }
    }()
    go func() {
        fmt.Println("======begin work======")
        panic("nil pointer exception")
    }()
    time.Sleep(time.Second*100)
    fmt.Println("======after work======")
}

Returned results.

======begin work======
panic: nil pointer exception

goroutine 18 [running]:
...
Process finished with the exit code 2

Note that there is a uniform exception handling in the outer layer of Goroutine, but obviously the outer deferer of Goroutine does not cover this exception.

The reason for this is that we don’t know much about the Go source code. panic & recover have their own scope.

recover only works if called from within a defer.
panic allows multiple calls to be nested within a defer.
panic will only work for the current Goroutine defer

The reason why panic will only work for the current Goroutine’s defer is that when the newdefer allocates the _defer structure object, it will chain the allocated object to the head of the current Goroutine’s _defer table.

sobyte

Source code analysis

_panic struct

type _panic struct {
    argp      unsafe.Pointer // pointer to arguments of deferred call run during panic; cannot move - known to liblink
    arg       interface{}    // argument to panic
    link      *_panic        // link to earlier panic
    pc        uintptr        // where to return to in runtime if this panic is bypassed
    sp        unsafe.Pointer // where to return to in runtime if this panic is bypassed
    recovered bool           // whether this panic is over
    aborted   bool           // the panic was aborted
    goexit    bool
}

argp is a pointer to the argument to the defer call.
arg is the argument passed in when we call panic.
link is a pointer to an earlier call to the runtime._panic structure, i.e. painc can be called consecutively, forming a chain between them.
recovered indicates whether the current runtime._panic has been recovered.
aborted indicates whether the current panic has been forcibly terminated.

The main effect of these three keywords for pc, sp, and goexit is that it is possible for a panic to occur in a defer and then be recovered in an upper-level defer by recovering it, then the recovered process will actually resume normal execution on top of the Goexit framework and therefore abort Goexit.

A discussion of the pc, sp and goexit fields and code commits can be found here: https://github.com/golang/go/commit/7dcd343ed641d3b70c09153d3b041ca3fe83b25e and this discussion runtime: panic + recover can cancel a call to Goexit.

panic process

sobyte

the compiler converts the keyword panic to runtime.gopanic and calls it, then it keeps fetching deferers from the current Goroutine’s defer table in a loop and executing them.
if the defer function called has recover in it, then runtime.gorecover is called, which modifies the recovered field of runtime._panic to true.
After calling the defer function and returning to the runtime.gopanic main logic, checking that the recovered field is true will retrieve the program counter pc and stack pointer sp from the runtime._defer structure and call the runtime.recovery function to recover the program. runtime.recvoery sets the return value of the function to 1 during dispatch.
when the return value of the runtime.deferproc function is 1, the compiler-generated code jumps directly to the caller function before it returns and executes runtime.deferreturn, then the program has recovered from panic and executes the normal logic.
after runtime.gopanic has executed all the _defer and has not encountered recover either, then runtime.fatalpanic is executed to terminate the program and return error code 2.

So the whole process is divided into two parts: 1. logic with recover, where the panic can recover, and 2. logic without recover, where the panic simply crashes.

Trigger panic to crash directly

func gopanic(e interface{}) {
    gp := getg()
    ...
    var p _panic   
    // 创建新的 runtime._panic 并添加到所在 Goroutine 的 _panic 链表的最前面
    p.link = gp._panic
    gp._panic = (*_panic)(noescape(unsafe.Pointer(&p))) 

    for {
        // 获取当前gorourine的 defer
        d := gp._defer
        if d == nil {
            break
        }
        ...
        d._panic = (*_panic)(noescape(unsafe.Pointer(&p))) 
        // 运行defer调用函数
        reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz), uint32(d.siz), &regs) 
        d._panic = nil 
        d.fn = nil
        gp._defer = d.link
        // 将defer从当前goroutine移除
        freedefer(d) 
        // recover 恢复程序
        if p.recovered {
            ...
        }
    } 
    // 打印出全部的 panic 消息以及调用时传入的参数
    preprintpanics(gp._panic)
    // fatalpanic实现了无法被恢复的程序崩溃
    fatalpanic(gp._panic)  
    *(*int)(nil) = 0       
}

Let’s look at the logic first.

it first fetches the current Goroutine, creates a new runtime._panic and adds it to the top of the _panic chain of the Goroutine it’s in.
then it goes into a loop to get the current Goroutine’s defer table and calls reflectcall to run the defer function.
after running it removes the defer from the current Goroutine, as we assume here that there is no recover logic, then fatalpanic will be called to stop the whole program.

func fatalpanic(msgs *_panic) {
    pc := getcallerpc()
    sp := getcallersp()
    gp := getg()
    var docrash bool 
    systemstack(func() {
        if startpanic_m() && msgs != nil { 
            printpanics(msgs)
        }

        docrash = dopanic_m(gp, pc, sp)
    })
    if docrash {
        crash()
    } 
    systemstack(func() {
        exit(2)
    })
    *(*int)(nil) = 0 // not reached
}

fatalpanic It prints out the full panic message and the arguments passed in when it is called via printpanics before aborting the program, then calling exit and returning error code 2.

Triggering a panic recovery

The recover keyword is called in runtime.gorecover.

func gorecover(argp uintptr) interface{} { 
    gp := getg()
    p := gp._panic
    if p != nil && !p.goexit && !p.recovered && argp == uintptr(p.argp) {
        p.recovered = true
        return p.arg
    }
    return nil
}

If the current Goroutine does not call panic, then the function will simply return nil; p.Goexit determines if the current one is triggered by goexit, and as the example above says, recover is not able to block goexit.

If the condition is met, then the recovered field will eventually be modified to ture, and then recovery will be performed in runtime.gopanic.

func gopanic(e interface{}) {
    gp := getg()
    ...
    var p _panic   
    // 创建新的 runtime._panic 并添加到所在 Goroutine 的 _panic 链表的最前面
    p.link = gp._panic
    gp._panic = (*_panic)(noescape(unsafe.Pointer(&p))) 

    for {
        // 获取当前gorourine的 defer
        d := gp._defer  
        ...
        pc := d.pc
        sp := unsafe.Pointer(d.sp) 
        // recover 恢复程序
        if p.recovered {
            // 获取下一个 panic
            gp._panic = p.link
            // 如果该panic是 goexit 触发的，那么会恢复到 goexit 逻辑代码中执行 exit
            if gp._panic != nil && gp._panic.goexit && gp._panic.aborted {
                gp.sigcode0 = uintptr(gp._panic.sp)
                gp.sigcode1 = uintptr(gp._panic.pc)
                mcall(recovery)
                throw("bypassed recovery failed") // mcall 会恢复正常的代码逻辑，不会走到这里
            }
            ...

            gp._panic = p.link
            for gp._panic != nil && gp._panic.aborted {
                gp._panic = gp._panic.link
            }
            if gp._panic == nil { 
                gp.sig = 0
            }
            gp.sigcode0 = uintptr(sp)
            gp.sigcode1 = pc
            mcall(recovery)
            throw("recovery failed") // mcall 会恢复正常的代码逻辑，不会走到这里
        }
    } 
    ...
}

Two mcall(recovery) calls to recovery are included here.

The first part if gp._panic ! = nil && gp._panic.goexit && gp._panic.aborted determines mainly for Goexit, ensuring that Goexit will also be recovered to perform an exit when Goexit is executed.

The second part is to do the panic recovery, taking the program counters pc and sp from runtime._defer and calling recovery to trigger the program recovery.

func recovery(gp *g) { 
    sp := gp.sigcode0
    pc := gp.sigcode1
    ...
    gp.sched.sp = sp
    gp.sched.pc = pc
    gp.sched.lr = 0
    gp.sched.ret = 1
    gogo(&gp.sched)
}

The recovery here sets the return value of the function to 1, and the call to gogo jumps back to where the defer keyword was called, and the Goroutine continues to execute.

func deferproc(siz int32, fn *funcval) {  
    ...
    // deferproc returns 0 normally.
    // a deferred func that stops a panic
    // makes the deferproc return 1.
    // the code the compiler generates always
    // checks the return value and jumps to the
    // end of the function if deferproc returns != 0.
    return0() 
}

We know from the comments that when deferproc returns a value of 1, the compiler generates code that jumps directly to the caller’s function before it returns and executes runtime.deferreturn.

What are the pitfalls in runtime?

Just because we don’t recommend using panic when implementing our business doesn’t mean it’s not used in runtime, which is a big trap for newcomers who don’t know the underlying Go implementation. It is impossible to write robust Go code if you are not familiar with these pitfalls.

Here I’ll categorise the exceptions in runtime, some of which are not caught by recover, and some of which are normal panics that can be caught.

Uncatchable exceptions

memory overflow

func main() {
    defer errorHandler()
    _ = make([]int64, 1<<40)
    fmt.Println("can recover")
}

func errorHandler() {
    if r := recover(); r != nil {
        fmt.Println(r)
    }
}

If you call alloc to allocate memory, you will call grow to request new memory from the system. If you call mmap to request memory and return _ENOMEM, you will throw a runtime: out of memory exception, and throw will call exit to cause the whole program to exit.

func sysMap(v unsafe.Pointer, n uintptr, sysStat *sysMemStat) {
    sysStat.add(int64(n))

    p, err := mmap(v, n, _PROT_READ|_PROT_WRITE, _MAP_ANON|_MAP_FIXED|_MAP_PRIVATE, -1, 0)
    if err == _ENOMEM {
        throw("runtime: out of memory")
    }
    if p != v || err != 0 {
        throw("runtime: cannot map pages in arena address space")
    }
}

func throw(s string) {
    ...
    fatalthrow()
    *(*int)(nil) = 0 // not reached
}

func fatalthrow() { 
    systemstack(func() { 
        ...
        exit(2)
    })

}

map Concurrent read and write

func main() {
    defer errorHandler()
    m := map[string]int{}

    go func() {
        for {
            m["x"] = 1
        }
    }()
    for {
        _ = m["x"]
    }
}

func errorHandler() {
    if r := recover(); r != nil {
        fmt.Println(r)
    }
}

Since map is not thread-safe, it throws a concurrent map read and map write exception when it encounters concurrent reads and writes, which causes the program to exit straight away.

func mapaccess1_faststr(t *maptype, h *hmap, ky string) unsafe.Pointer {
    ...
    if h.flags&hashWriting != 0 {
        throw("concurrent map read and map write")
    }
    ...
｝

The throw here, like above, will eventually be called to exit.

I used to work in java, and when I encountered concurrent stateful problems with hashmap, it just threw an exception and didn’t cause the program to crash.

The official explanation for this is as follows.

The runtime has added lightweight, best-effort detection of concurrent misuse of maps. As always, if one goroutine is writing to a map, no other goroutine should be reading or writing the map concurrently. If the runtime detects this condition, it prints a diagnosis and crashes the program. The best way to find out more about the problem is to run the program under the race detector, which will more reliably identify the race and give more detail.

running out of stack memory

func main() {
    defer errorHandler()
    var f func(a [1000]int64)
    f = func(a [1000]int64) {
        f(a)
    }
    f([1000]int64{})
}

This example would return.

1
2
3

runtime: goroutine stack exceeds 1000000000-byte limit
runtime: sp=0xc0200e1be8 stack=[0xc0200e0000, 0xc0400e0000]
fatal error: stack overflow

Let me briefly explain the basic mechanics of the stack.

In Go, Goroutines do not have a fixed stack size. Instead, they start small (say 4KB) and grow/shrink as needed, seemingly giving the impression of an “infinite” stack. But growth is always finite, but this limit comes not from the call depth limit, but from the stack memory limit, which is 1GB on Linux 64-bit machines.

var maxstacksize uintptr = 1 << 20 // enough until runtime.main sets it for real

func newstack() {
    ...
    if newsize > maxstacksize || newsize > maxstackceiling { 
        throw("stack overflow")
    }
    ...
}

In stack expansion, it is checked that the new stack size exceeds the threshold 1 << 20, and if it does, throw("stack overflow") is called and an exit is executed, causing the whole program to crash.

tries to give the nil function to goroutine to start

func main() {
    defer errorHandler()
    var f func()
    go f()
}

Here too, it will simply crash.

All threads are hibernating

Normally, not all threads in a program will be hibernating, there will always be threads running to handle our tasks, e.g.

func main() {
    defer errorHandler()
    go func() {
        for true {
            fmt.Println("alive")
            time.Sleep(time.Second*1) 
        }
    }()
    <-make(chan int)
}

However, some students have done some “interesting” things, such as not handling the logic of our code very well and adding some code to the logic that will permanently block.

func main() {
    defer errorHandler()
    go func() {
        for true {
            fmt.Println("alive")
            time.Sleep(time.Second*1)
            select {}
        }
    }()
    <-make(chan int)
}

For example, if you add a select to a goroutine, this will cause a permanent block, and go will crash the program if it detects that there is no goroutine left to run.

`1`	`fatal error: all goroutines are asleep - deadlock!`

Exceptions that can be caught

array ( slice ) subscript out of bounds

func foo(){
    defer func() {
        if r := recover(); r != nil {
            fmt.Println(r)
        }
    }()
    var bar = []int{1}
    fmt.Println(bar[1])
}

func main(){ 
    foo()
    fmt.Println("exit")
}

Return.

1
2

runtime error: index out of range [1] with length 1
exit

Because of the use of recover in the code, the program resumes with the output exit.

null pointer exception

func foo(){
    defer func() {
        if r := recover(); r != nil {
            fmt.Println(r)
        }
    }()
    var bar *int
    fmt.Println(*bar)
}

func main(){
    foo()
    fmt.Println("exit")
}

Return.

1
2

runtime error: invalid memory address or nil pointer dereference
exit

In addition to the above, another common scenario is that we have a variable that is initialized but left empty, but the Receiver is a pointer.

type Shark struct {
    Name string
}

func (s *Shark) SayHello() {
    fmt.Println("Hi! My name is", s.Name)
}

func main() {
    s := &Shark{"Sammy"}
    s = nil
    s.SayHello()
}

sends data to a chan that has been closed

func foo(){
    defer func() {
        if r := recover(); r != nil {
            fmt.Println(r)
        }
    }()
    var bar = make(chan int, 1)
    close(bar)
    bar<-1
}

func main(){
    foo()
    fmt.Println("exit")
}

Results

1
2

send on closed channel
exit

func chansend(c *hchan, ep unsafe.Pointer, block bool, callerpc uintptr) bool {
    ...
    //加锁
    lock(&c.lock)
    // 是否关闭的判断
    if c.closed != 0 {
        unlock(&c.lock)
        panic(plainError("send on closed channel"))
    }
    // 从 recvq 中取出一个接收者
    if sg := c.recvq.dequeue(); sg != nil { 
        // 如果接收者存在，直接向该接收者发送数据，绕过buffer
        send(c, sg, ep, func() { unlock(&c.lock) }, 3)
        return true
    }
    ...
}

When sending, it is determined whether the chan has been closed.

Type Assertion

func foo(){
    defer func() {
        if r := recover(); r != nil {
            fmt.Println(r)
        }
    }()
    var i interface{} = "abc"
    _ = i.([]string)
}

func main(){
    foo()
    fmt.Println("exit")
}