This article will teach you how to understand Go’s plan9 assembly through stack operations.

Knowledge points

Linux process in memory layout

memory layout

Each process in a multitasking operating system runs in its own memory sandbox. In 32-bit mode, it is always 4GB of memory address space, and memory allocation is to allocate virtual memory to processes. When a process actually accesses a virtual memory address, the OS allocates a corresponding space on physical memory and then establishes a mapping relationship with it by triggering a missing page interrupt, so that the virtual memory address accessed by the process will be automatically converted into a valid physical memory address, and then data can be stored and accessed. The virtual memory address accessed by the process is automatically converted to a valid physical memory address, and data can be stored and accessed.

Kernel space : the operating system kernel address space.

Stack : The stack space is where the user stores local variables created temporarily by the program, and the stack grows from the high address to the high address down. On modern mainstream machine architectures (e.g., x86), the stack grows downward. However, there are some processors (e.g., B5000) where the stack grows upwards, some architectures (e.g., System Z) that allow custom stack growth, and even some processors (e.g., SPARC) that handle a cyclic stack.

Heap : heap space, which is used to hold memory segments that are dynamically allocated while the process is running; it is not fixed in size and can be dynamically expanded or reduced.

BBS segment : BSS segment, which holds global or static data, but holds global/static uninitialized data.

Data segment : a data segment, usually a memory area used to hold global variables that have been initialized in a program.

Text segment : A code segment, which refers to a memory area used to hold the code for program execution. The size of this segment is determined before the program runs, and the memory area is read-only.

In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program.

In computer programming, a subroutine is a sequence of program instructions that performs a specific task, packaged as a unit.

A stack frame is a frame of data that gets pushed onto the stack. In the case of a call stack, a stack frame would represent a function call and its argument data.

With function calls, there is a caller and a callee, e.g., if function A calls function B, A is the caller and B is the callee.

The stack structure of the caller and the callee is shown in the following figure.

sobyte

The Go assembly code is very vague in explaining the stack registers, so we probably only need to know the role of two registers, BP and SP.

BP: base pointer register , which maintains the base address of the current stack frame for indexing variables and parameters, like an anchor point, which in other architectures is equivalent to the frame pointer FP, except that in the x86 architecture, variables and parameters can be indexed by SP.

SP: stack pointer register , always pointing to the top of the stack.

Goroutine stack operations

sobyte

In Goroutine there is a stack data structure with two attributes lo and hi that describe the actual stack memory address.

  • stack.lo: the low address of the stack space.
  • stack.hi: the high address of the stack space.

In Goroutine it will be determined if stack growth is to be done by stackguard0: * stack.lo: the low address of the stack space; * stack.hi: the high address of the stack space.

  • stackguard0: stack.lo + StackGuard , used for stack overlow detection.
  • StackGuard: protected area size, 928 bytes on constant Linux.
  • StackSmall: constant size 128 bytes, used for optimization of small function calls; * StackSmall: constant size 128 bytes, used for optimization of small function calls.
  • StackBig: constant size 4096 bytes.

To determine if expansion is needed based on the size of the called function stack frame.

  1. when the stack frame size (FramSzie) is less than or equal to StackSmall (128), if SP is less than stackguard0 then perform stack expansion.
  2. when the stack frame size (FramSzie) is greater than StackSmall (128), a comparison is made between SP - FramSzie + StackSmall and stackguard0 according to the formula SP - FramSzie + StackSmall, and if is less than stackguard0 then a stack expansion is performed.
  3. When the stack frame size (FramSzie) is larger than StackBig (4096), it will first check whether stackguard0 has been converted to StackPreempt state; then it will perform expansion according to the formula SP - stackguard0 + StackGuard <= framesize + (StackGuard- StackSmall) and if it is true, then perform the expansion.

It should be noted that since the stack grows from the high address to the low address, the comparison is all less than before the expansion is executed, and here you need to think about it.

When the stack expansion is executed, a larger stack memory space will be allocated in the memory space, then all the contents of the old stack will be copied to the new stack, and the pointers to the variables corresponding to the old stack will be modified to point to the new stack again, and finally the memory space of the old stack will be destroyed and recycled, thus realizing the dynamic expansion of the stack.

Assembly

Here is a brief explanation of some of the Plan 9 assemblies used by the Go language that will be used later in the analysis, in case it is not very clear.

assembly functions

Let’s start by looking at the definition of the plan9 assembly functions.

sobyte

stack frame size: contains local variables and the space for additional function calls.

arguments size: contains the size of the arguments as well as the return value, e.g. if the input is 3 int64 types and the return value is 1 int64 type, then the return value is sizeof(int64) * 4.

Stack adjustment

Stack adjustment is achieved by performing operations on the hardware SP registers, for example:

1
2
3
SUBQ    $24, SP  // 对 sp 做减法,为函数分配函数栈帧 
...
ADDQ    $24, SP  // 对 sp 做加法 ,清除函数栈帧

Since the stack grows downward, SUBQ actually allocates stack frames for the function when it subtracts from SP, and ADDQ clears the stack frames.

Common instructions

Addition and subtraction operations.

1
2
ADDQ  AX, BX   // BX += AX
SUBQ  AX, BX   // BX -= AX

Data handling.

Constants are denoted by $num in plan9 assembly, can be negative, and are decimal by default. The length of the carry is determined by the suffix of the MOV.

1
2
3
4
MOVB $1, DI      // 1 byte
MOVW $0x10, BX   // 2 bytes
MOVD $1, DX      // 4 bytes
MOVQ $-10, AX     // 8 bytes

Another difference is that when using MOVQ you will see a difference between with and without parentheses.

1
2
3
4
5
6
// 加括号代表是指针的引用
MOVQ (AX), BX   // => BX = *AX 将AX指向的内存区域8byte赋值给BX
MOVQ 16(AX), BX // => BX = *(AX + 16)

//不加括号是值的引用
MOVQ AX, BX     // => BX = AX 将AX中存储的内容赋值给BX,注意区别

jump.

1
2
3
4
5
6
7
// 无条件跳转
JMP addr   // 跳转到地址,地址可为代码中的地址
JMP label  // 跳转到标签,可以跳转到同一函数内的标签位置
JMP 2(PC)  // 以当前指令为基础,向前/后跳转 x 行

// 有条件跳转
JLS addr

Address arithmetic.

1
LEAQ (AX)(AX*2), CX // => CX = AX + (AX * 2) = AX * 3

The 2 in the above code stands for scale, scale can only be 0, 2, 4, 8.

Parsing

Creation of G

Since the stack is on the Goroutine, let’s start with the creation of G and how to initialize the stack space. Only the code for the initialization part of the stack is explained here.

G creation is done by calling runtime-newproc to create.

runtime.newproc

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func newproc(siz int32, fn *funcval) {
    argp := add(unsafe.Pointer(&fn), sys.PtrSize)
    gp := getg()
    // 获取 caller 的 PC 寄存器
    pc := getcallerpc()
    // 切换到 G0 进行创建
    systemstack(func() {
        newg := newproc1(fn, argp, siz, gp, pc)
        ...
    })
}

The newproc method switches to G0 to call the newproc1 function for G creation.

runtime.newproc1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
const _StackMin = 2048
func newproc1(fn *funcval, argp unsafe.Pointer, narg int32, callergp *g, callerpc uintptr) *g {
    _g_ := getg()
    ...
    _p_ := _g_.m.p.ptr()
    // 从 P 的空闲链表中获取一个新的 G
    newg := gfget(_p_)
    // 获取不到则调用 malg 进行创建
    if newg == nil {
        newg = malg(_StackMin)
        casgstatus(newg, _Gidle, _Gdead)
        allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
    }
    ...
    return newg
}

The newproc1 method is very long and it mainly gets G and then does some initialization work on the obtained G. Let’s just look at the malg function call here.

When calling the malg function, a minimum stack size value is passed in: _StackMin (2048).

runtime.malg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
func malg(stacksize int32) *g {
    // 创建 G 结构体
    newg := new(g)
    if stacksize >= 0 {
        // 这里会在 stacksize 的基础上为每个栈预留系统调用所需的内存大小 _StackSystem
        // 在 Linux/Darwin 上( _StackSystem == 0 )本行不改变 stacksize 的大小
        stacksize = round2(_StackSystem + stacksize)
        // 切换到 G0 为 newg 初始化栈内存
        systemstack(func() {
            newg.stack = stackalloc(uint32(stacksize))
        })
        // 设置 stackguard0 ,用来判断是否要进行栈扩容
        newg.stackguard0 = newg.stack.lo + _StackGuard
        newg.stackguard1 = ^uintptr(0) 
        *(*uintptr)(unsafe.Pointer(newg.stack.lo)) = 0
    }
    return newg
}

The call to malg sets aside the incoming memory size plus a _StackSystem value for use by the system call, and the round2 function rounds the incoming value to an exponent of 2. Then it switches to G0 and executes the stackalloc function to allocate the stack memory.

After allocation, stackguard0 is set to stack.lo + _StackGuard, which is used to determine if stack expansion is needed, as discussed below.

Initialization of the stack

File location: src/runtime/stack.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Global stack cache, allocating less than 32KB of memory
var stackpool [_NumStackOrders]struct {
    item stackpoolItem
    _ [cpu.CacheLinePadSize - unsafe.Sizeof(stackpoolItem{})%cpu.CacheLinePadSize]byte
}

//go:notinheap
type stackpoolItem struct {
    mu mutex
    span mSpanList
} 

// global stack cache, allocating more than 32KB of memory
var stackLarge struct {
    lock mutex
    free [heapAddrBits - pageShift]mSpanList // free lists by log_2(s.npages)
}

// initialize stackpool/stackLarge global variables
func stackinit() {
    if _StackCacheSize&_PageMask ! = 0 {
        throw("cache size must be a multiple of page size")
    }
    for i := range stackpool {
        stackpool[i].item.span.init()
        lockInit(&stackpool[i].item.mu, lockRankStackpool)
    }
    for i := range stackLarge.free {
        stackLarge.free[i].init()
        lockInit(&stackLarge.lock, lockRankStackLarge)
    }
}

Before we do the stack allocation let’s take a look at what is done during stack initialization. Note that stackinit is initialized by calling runtime-schedinit, which is done before calling runtime-newproc.

Two global variables, stackpool and stackLarge, are initialized when stack initialization is performed. stackpool can allocate less than 32KB of memory, and stackLarge is used to allocate more than 32KB of stack space.

Allocation of the stack

We also know from the initialization of the two two global variables that the stack will be allocated from different locations depending on the size.

Small stack memory allocation

File location: src/runtime/stack.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
func stackalloc(n uint32) stack { 
    // 这里的 G 是 G0
    thisg := getg()
    ...
    var v unsafe.Pointer
    // 在 Linux 上,_FixedStack = 2048、_NumStackOrders = 4、_StackCacheSize = 32768
    // 如果申请的栈空间小于 32KB
    if n < _FixedStack<<_NumStackOrders && n < _StackCacheSize {
        order := uint8(0)
        n2 := n
        // 大于 2048 ,那么 for 循环 将 n2 除 2,直到 n 小于等于 2048
        for n2 > _FixedStack {
            // order 表示除了多少次
            order++
            n2 >>= 1
        }
        var x gclinkptr
        //preemptoff != "", 在 GC 的时候会进行设置,表示如果在 GC 那么从 stackpool 分配
        // thisg.m.p = 0 会在系统调用和 改变 P 的个数的时候调用,如果发生,那么也从 stackpool 分配
        if stackNoCache != 0 || thisg.m.p == 0 || thisg.m.preemptoff != "" { 
            lock(&stackpool[order].item.mu)
            // 从 stackpool 分配
            x = stackpoolalloc(order)
            unlock(&stackpool[order].item.mu)
        } else {
            // 从 P 的 mcache 分配内存
            c := thisg.m.p.ptr().mcache
            x = c.stackcache[order].list
            if x.ptr() == nil {
                // 从堆上申请一片内存空间填充到stackcache中
                stackcacherefill(c, order)
                x = c.stackcache[order].list
            }
            // 移除链表的头节点
            c.stackcache[order].list = x.ptr().next
            c.stackcache[order].size -= uintptr(n)
        }
        // 获取到分配的span内存块
        v = unsafe.Pointer(x)
    } else {
        ...
    }
    ...
    return stack{uintptr(v), uintptr(v) + uintptr(n)}
}

stackalloc will allocate according to the size of the passed parameter n. On Linux, if n is less than 32768 bytes, which is 32KB, then it will enter the allocation logic of small stack.

The small stack refers to the stack size of 2K/4K/8K/16K, and during allocation, different order values will be calculated according to the size, if the stack size is 2K, then the order is 0, 4K corresponds to order 1, and so on. On the one hand, this can reduce the lock conflicts between different Goroutines getting different stack sizes, and on the other hand, the span of the corresponding size can be cached in advance for quick access.

thisg.m.p == 0 may happen when exitsyscall is called or when the number of P’s procresize is changed, thisg.m.preemptoff ! = "" will occur during GC. That is to say, when exitsyscall is called or the number of P’s is changed, or when GC occurs, stack space is allocated from stackpool, otherwise it is taken from mcache.

If the stackcache corresponding to mcache is not available, then stackcacherefill is called to request a piece of memory from the heap to fill the stackcache.

The main thing to note is that stackalloc is G0 since it switches to G0 for the call, so thisg is G0.

1
2
3
4
5
6
7
8
func stackalloc(n uint32) stack { 
    thisg := getg()
    // 添加一行打印
    if debug.schedtrace > 0 {
        print("stackalloc runtime: gp: gp=", thisg, ", goid=", thisg.goid, ", gp->atomicstatus=", readgstatus(thisg), "\n")
    }
    ...
}

Let’s look at the stackpoolalloc and stackcacherefill functions separately.

runtime.stackpoolalloc

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
func stackpoolalloc(order uint8) gclinkptr {
    list := &stackpool[order].item.span
    s := list.first
    lockWithRankMayAcquire(&mheap_.lock, lockRankMheap)
    if s == nil {
        // no free stacks. Allocate another span worth.
        // 从堆上分配 mspan
        // _StackCacheSize = 32 * 1024
        s = mheap_.allocManual(_StackCacheSize>>_PageShift, &memstats.stacks_inuse)
        if s == nil {
            throw("out of memory")
        }
        // 刚分配的 span 里面分配对象个数肯定为 0
        if s.allocCount != 0 {
            throw("bad allocCount")
        }
        if s.manualFreeList.ptr() != nil {
            throw("bad manualFreeList")
        }
        //OpenBSD 6.4+ 系统需要做额外处理
        osStackAlloc(s)
        // Linux 中 _FixedStack = 2048
        s.elemsize = _FixedStack << order
        //_StackCacheSize =  32 * 1024
        // 这里是将 32KB 大小的内存块分成了elemsize大小块,用单向链表进行连接
        // 最后 s.manualFreeList 指向的是这块内存的尾部
        for i := uintptr(0); i < _StackCacheSize; i += s.elemsize {
            x := gclinkptr(s.base() + i)
            x.ptr().next = s.manualFreeList
            s.manualFreeList = x
        }
        // 插入到 list 链表头部
        list.insert(s)
    }
    x := s.manualFreeList
    // 代表被分配完毕
    if x.ptr() == nil {
        throw("span has no free stacks")
    }
    // 将 manualFreeList 往后移动一个单位
    s.manualFreeList = x.ptr().next
    // 统计被分配的内存块
    s.allocCount++
    // 因为分配的时候第一个内存块是 nil
    // 所以当指针为nil 的时候代表被分配完毕
    // 那么需要将该对象从 list 的头节点移除
    if s.manualFreeList.ptr() == nil {
        // all stacks in s are allocated.
        list.remove(s)
    }
    return x
}

In the stackpoolalloc function will look for the stackpool corresponding to the order subscript of the head node of the span table, if not empty, then directly remove the node pointed to by the attribute manualFreeList of the head node from the linkedlist and return; if list.first is empty, then call mheap_’s allocManual function to allocate mspan from the heap.

If list.first is empty, then the allocManual function of mheap_ is called to allocate mspan from the heap.

The allocManual function allocates a 32KB block of memory, and after allocating the new span, it cuts the 32KB memory according to the size of elemsize, and then strings it together in a one-way chain and assigns the last memory address to manualFreeList.

For example, the current elemsize represents a memory size of 8KB.

sobyte

runtime.stackcacherefill

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func stackcacherefill(c *mcache, order uint8) { 
    var list gclinkptr
    var size uintptr
    lock(&stackpool[order].item.mu)
    //_StackCacheSize = 32 * 1024
    // 将 stackpool 分配的内存组成一个单向链表 list
    for size < _StackCacheSize/2 {
        x := stackpoolalloc(order)
        x.ptr().next = list
        list = x
        // _FixedStack = 2048
        size += _FixedStack << order
    }
    unlock(&stackpool[order].item.mu)
    c.stackcache[order].list = list
    c.stackcache[order].size = size
}

The stackcacherefill function will call stackpoolalloc to get half of the space from the stackpool and assemble it into a list chain, then put it into the stackcache array.

Large Stack Memory Allocation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
func stackalloc(n uint32) stack { 
    thisg := getg() 
    var v unsafe.Pointer

    if n < _FixedStack<<_NumStackOrders && n < _StackCacheSize {
        ...
    } else {
        // 申请的内存空间过大,从 runtime.stackLarge 中检查是否有剩余的空间
        var s *mspan
        // 计算需要分配多少个 span 页, 8KB 为一页
        npage := uintptr(n) >> _PageShift
        // 计算 npage 能够被2整除几次,用来作为不同大小内存的块的索引
        log2npage := stacklog2(npage)

        lock(&stackLarge.lock)
        // 如果 stackLarge 对应的链表不为空
        if !stackLarge.free[log2npage].isEmpty() {
            //获取链表的头节点,并将其从链表中移除
            s = stackLarge.free[log2npage].first
            stackLarge.free[log2npage].remove(s)
        }
        unlock(&stackLarge.lock)

        lockWithRankMayAcquire(&mheap_.lock, lockRankMheap)
        //这里是stackLarge为空的情况
        if s == nil {
            // 从堆上申请新的内存 span
            s = mheap_.allocManual(npage, &memstats.stacks_inuse)
            if s == nil {
                throw("out of memory")
            }
            // OpenBSD 6.4+ 系统需要做额外处理
            osStackAlloc(s)
            s.elemsize = uintptr(n)
        }
        v = unsafe.Pointer(s.base())
    }
    ...
    return stack{uintptr(v), uintptr(v) + uintptr(n)}
}

For large stack memory allocation, the runtime will check if there is any space left in stackLarge, and if there is no space left, it will also call mheap_.allocManual to request new memory from the heap.

Stack expansion

Stack overflow detection

The compiler will execute: src/cmd/internal/obj/x86/obj6.go:stacksplit at the time of target code generation to check if the current goroutine has enough stack space by inserting the appropriate instructions based on the function stack frame size.

  1. when the stack frame size (FramSzie) is less than or equal to StackSmall (128), if SP is less than stackguard0 then perform stack expansion.
  2. when the stack frame size (FramSzie) is larger than StackSmall (128), the stack is expanded according to the formula SP - FramSzie + StackSmall compared to stackguard0, if is smaller than stackguard0 then the expansion is performed.
  3. When the stack frame size (FramSzie) is larger than StackBig (4096), it will first check whether stackguard0 has been converted to StackPreempt state; then it will perform expansion according to the formula SP - stackguard0 + StackGuard <= framesize + (StackGuard- StackSmall) and if it is true, then perform the expansion.

Let’s take a look at the pseudo-code which will make it a bit clearer.

When the stack frame size (FramSzie) is less than or equal to StackSmall (128).

1
2
CMPQ SP, stackguard
JEQ label-of-call-to-morestack

When the stack frame size (FramSzie) is larger than StackSmall (128).

1
2
3
LEAQ -xxx(SP), AX 
CMPQ AX, stackguard
JEQ label-of-call-to-morestack

where AX = SP - framesize + StackSmall and then the CMPQ instruction is executed to make AX compare with stackguard.

When the stack frame size (FramSzie) is larger than StackBig (4096).

1
2
3
4
5
6
MOVQ    stackguard, SI // SI = stackguard
CMPQ    SI, $StackPreempt // compare SI ,StackPreempt
JEQ label-of-call-to-morestack
LEAQ    StackGuard(SP), AX // AX = SP + StackGuard
SUBQ    SI, AX // AX = AX - SI =  SP + StackGuard -stackguard
CMPQ    AX, $(framesize+(StackGuard-StackSmall))

The pseudo-code here will be relatively complicated, as the stackguard0 inside G may be assigned to StackPreempt when it is preempted, so it is clear whether it is preempted or not, then it is necessary to compare stackguard0 with StackPreempt. Then the comparison will be performed: SP-stackguard+StackGuard <= framesize + (StackGuard-StackSmall), with StackGuard added on both sides to ensure that the value on the left is positive.

I hope you don’t read further until you understand the above code.

The main thing to note is that in some functions, the compiler intelligently adds the NOSPLIT flag, which disables stack overflow detection after hitting this flag, which can be found in the following code.

Code location: cmd/internal/obj/x86/obj6.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
...
    if ctxt.Arch.Family == sys.AMD64 && autoffset < objabi.StackSmall && !p.From.Sym.NoSplit() {
        leaf := true
    LeafSearch:
        for q := p; q != nil; q = q.Link {
            ...
        }

        if leaf {
            p.From.Sym.Set(obj.AttrNoSplit, true)
        }
    }
...

The general code logic should be: when the function is at a leaf node of the call chain and the stack frame is smaller than StackSmall bytes, it is automatically marked as NOSPLIT. similarly, we can write the code ourselves by adding //go:nosplit to the function to force the NOSPLIT attribute.

Stack overflow example

Here we write a simple example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
func main() {
    a, b := 1, 2
    _ = add1(a, b)
    _ = add2(a, b)
    _ = add3(a, b)
}

func add1(x, y int) int {
    _ = make([]byte, 20)
    return x + y
}

func add2(x, y int) int {
    _ = make([]byte, 200)
    return x + y
}

func add3(x, y int) int {
    _ = make([]byte, 5000)
    return x + y
}

Then print out its assembly:

1
$ GOOS=linux GOARCH=amd64 go tool compile -S -N -l main.go

The example above explains the three cases described above in three method calls.

main function

1
2
3
4
5
6
7
8
        0x0000 00000 (main.go:3)        TEXT    "".main(SB), ABIInternal, $48-0
        0x0000 00000 (main.go:3)        MOVQ    (TLS), CX 
        0x0009 00009 (main.go:3)        CMPQ    SP, 16(CX) // SP < stackguard 则跳到 129执行
        0x0009 00009 (main.go:3)        CMPQ    SP, 16(CX)
        0x000d 00013 (main.go:3)        PCDATA  $0, $-2
        0x000d 00013 (main.go:3)        JLS     129
        ... 
        0x0081 00129 (main.go:3)        CALL    runtime.morestack_noctxt(SB)

First, we load a value from the TLS (thread local storage) variable into the CX register, and then compare the SP with 16(CX), so what is TLS and what does 16(CX) stand for?

Actually, TLS is a pseudo-register that represents the thread-local storage, which holds the G structure. Let’s take a look at the definition of G in the runtime source code.

1
2
3
4
5
6
7
8
9
type g struct { 
    stack       stack   // offset known to runtime/cgo
    stackguard0 uintptr // offset known to liblink
    ...
}
type stack struct {
    lo uintptr
    hi uintptr
}

You can see that the stack takes up 16bytes, so 16(CX) corresponds to g.stackguard0. So the CMPQ SP, 16(CX) line of code actually compares the SP and stackguard sizes. If the SP is smaller than the stackguard, then the growth threshold is reached and JLS will be executed to jump to line 129 and call runtime.morestack_noctxt to perform the next stack expansion operation.

add1

1
        0x0000 00000 (main.go:10)       TEXT    "".add1(SB), NOSPLIT|ABIInternal, $32-24

When we look at the add1 assembly function, we can see that its stack size is only 32, which does not reach the StackSmall 128 bytes size, and it is a callee caller, so we can send it with the NOSPLIT flag, which confirms my conclusion above.

add2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
"".add2 STEXT size=148 args=0x18 locals=0xd0
        0x0000 00000 (main.go:15)       TEXT    "".add2(SB), ABIInternal, $208-24
        0x0000 00000 (main.go:15)       MOVQ    (TLS), CX
        // AX = SP  - 208 + 128 = SP -80
        0x0009 00009 (main.go:15)       LEAQ    -80(SP), AX // 栈大小大于StackSmall =128, 计算 SP - FramSzie + StackSmall 并放入AX寄存器                         
        0x000e 00014 (main.go:15)       CMPQ    AX, 16(CX) // AX < stackguard 则跳到 138 执行
        0x0012 00018 (main.go:15)       PCDATA  $0, $-2
        0x0012 00018 (main.go:15)       JLS     138
        ...
        0x008a 00138 (main.go:15)       CALL    runtime.morestack_noctxt(SB)

The stack frame size of add2 function is 208, which is larger than StackSmall 128 bytes, so you can see that it first loads a value from the TLS variable to the CX register.

Then execute the instruction LEAQ -80(SP), AX, but the reason why it is -80 actually made me quite confused at that time, but it should be noted that the formula here is: SP - FramSzie + StackSmall, after directly substituting it, you will find that it is -80, and then load this value into the AX register.

Finally call CMPQ AX, 16(CX) , 16(CX) we have already talked about above is equal to stackguard0, so here is a comparison between AX and stackguard0 of small and large, if less than then directly jump to line 138 to execute runtime.morestack_noctxt.

add3

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
"".add3 STEXT size=157 args=0x18 locals=0x1390
        0x0000 00000 (main.go:20)       TEXT    "".add3(SB), ABIInternal, $5008-24
        0x0000 00000 (main.go:20)       MOVQ    (TLS), CX 
        0x0009 00009 (main.go:20)       MOVQ    16(CX), SI // 将 stackguard 赋值给  SI
        0x000d 00013 (main.go:20)       PCDATA  $0, $-2 
        0x000d 00013 (main.go:20)       CMPQ    SI, $-1314 // 将 stackguard < stackPreempt 则跳转到 147 执行
        0x0014 00020 (main.go:20)       JEQ     147 
        0x0016 00022 (main.go:20)       LEAQ    928(SP), AX // AX = SP +928
        0x001e 00030 (main.go:20)       SUBQ    SI, AX // AX -= stackguard
        0x0021 00033 (main.go:20)       CMPQ    AX, $5808 // framesize + 928 -128  = 5808,比较 AX < 5808,则执行147
        0x0027 00039 (main.go:20)       JLS     147
        ...
        0x0093 00147 (main.go:20)       CALL    runtime.morestack_noctxt(SB)

The add3 function directly allocates an array of 5000 bytes on the stack, so it starts the same way, loading a value from the TLS variable into the CX register and then assigning stackguard0 to the SI register.

Next, the instruction CMPQ SI, $-1314 will be executed, which actually compares the size of stackguard0 and StackPreempt, and the reason why it is -1314 is that the StackPreempt variable is called directly when the assembly code is inserted, and this variable is written inside the code.

Code location: cmd/internal/objabi/stack.go

1
2
3
const (
    StackPreempt = -1314 // 0xfff...fade
)

If it is not preempted, then go straight down to LEAQ 928(SP), AX, which is equivalent to AX = SP +_StackGuard, and _StackGuard is equivalent to 928 in Linux.

Next execute SUBQ SI, AX, which is equal to AX -= stackguard0.

Finally CMPQ AX, $5808, the 5808 is actually framesize + _StackGuard - _StackSmall, if AX is less than 5808 then jump to line 147 and execute the runtime.morestack_noctxt function.

This is the end of stack overflow detection, I’ve read other articles and I don’t think any of them are as comprehensive as mine, especially when the stack frame size is larger than _StackBig.

Stack expansion

runtime.morestack_noctxt is implemented in assembly and it calls to runtime-morestack, let’s see its implementation as follows.

Code location: src/runtime/asm_amd64.s

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
TEXT runtime·morestack(SB),NOSPLIT,$0-0
    // Cannot grow scheduler stack (m->g0).
    // 无法增长调度器的栈(m->g0)
    get_tls(CX)
    MOVQ    g(CX), BX
    MOVQ    g_m(BX), BX
    MOVQ    m_g0(BX), SI
    CMPQ    g(CX), SI
    JNE 3(PC)
    CALL    runtime·badmorestackg0(SB)
    CALL    runtime·abort(SB)
    // 省略signal stack、morebuf和sched的处理
    ...
    // Call newstack on m->g0's stack.
    // 在 m->g0 栈上调用 newstack.
    MOVQ    m_g0(BX), BX
    MOVQ    BX, g(CX)
    MOVQ    (g_sched+gobuf_sp)(BX), SP
    CALL    runtime·newstack(SB)
    CALL    runtime·abort(SB)   // 如果 newstack 返回则崩溃 crash if newstack returns
    RET

After runtime-morestack does the checksum assignment, it switches to G0 and calls runtime-newstack to finish the expansion operation.

runtime-newstack

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
func newstack() {
    thisg := getg() 

    gp := thisg.m.curg

    // 初始化寄存器相关变量
    morebuf := thisg.m.morebuf
    thisg.m.morebuf.pc = 0
    thisg.m.morebuf.lr = 0
    thisg.m.morebuf.sp = 0
    thisg.m.morebuf.g = 0
    ...
    // 校验是否被抢占
    preempt := atomic.Loaduintptr(&gp.stackguard0) == stackPreempt

    // 如果被抢占
    if preempt {
        // 校验是否可以安全的被抢占
        // 如果 M 上有锁
        // 如果正在进行内存分配
        // 如果明确禁止抢占
        // 如果 P 的状态不是 running
        // 那么就不执行抢占了
        if !canPreemptM(thisg.m) {
            // 到这里表示不能被抢占?
            // Let the goroutine keep running for now.
            // gp->preempt is set, so it will be preempted next time.
            gp.stackguard0 = gp.stack.lo + _StackGuard
            // 触发调度器的调度
            gogo(&gp.sched) // never return
        }
    }

    if gp.stack.lo == 0 {
        throw("missing stack in newstack")
    }
    // 寄存器 sp
    sp := gp.sched.sp
    if sys.ArchFamily == sys.AMD64 || sys.ArchFamily == sys.I386 || sys.ArchFamily == sys.WASM {
        // The call to morestack cost a word.
        sp -= sys.PtrSize
    } 
    ...
    if preempt {
        //需要收缩栈
        if gp.preemptShrink { 
            gp.preemptShrink = false
            shrinkstack(gp)
        }
        // 被 runtime.suspendG 函数挂起
        if gp.preemptStop {
            // 被动让出当前处理器的控制权
            preemptPark(gp) // never returns
        }

        //主动让出当前处理器的控制权
        gopreempt_m(gp) // never return
    }

    // 计算新的栈空间是原来的两倍
    oldsize := gp.stack.hi - gp.stack.lo
    newsize := oldsize * 2 
    ... 
    //将 Goroutine 切换至 _Gcopystack 状态
    casgstatus(gp, _Grunning, _Gcopystack)

    //开始栈拷贝
    copystack(gp, newsize) 
    casgstatus(gp, _Gcopystack, _Grunning)
    gogo(&gp.sched)
}

The first half of the newstack function takes on the task of preempting the Goroutine.

The new stack is calculated to be twice the size of the original before starting the stack copy, and then the Goroutine state is switched to the _Gcopystack state.

Stack Copy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
func copystack(gp *g, newsize uintptr) { 
    old := gp.stack 
    // 当前已使用的栈空间大小
    used := old.hi - gp.sched.sp

    //分配新的栈空间
    new := stackalloc(uint32(newsize))
    ...

    // 计算调整的幅度
    var adjinfo adjustinfo
    adjinfo.old = old
    // 新栈和旧栈的幅度来控制指针的移动
    adjinfo.delta = new.hi - old.hi

    // 调整 sudogs, 必要时与 channel 操作同步
    ncopy := used
    if !gp.activeStackChans {
        ...
        adjustsudogs(gp, &adjinfo)
    } else {
        // 到这里代表有被阻塞的 G 在当前 G 的channel 中,所以要防止并发操作,需要获取 channel 的锁

        // 在所有 sudog 中找到地址最大的指针
        adjinfo.sghi = findsghi(gp, old) 
        // 对所有 sudog 关联的 channel 上锁,然后调整指针,并且复制 sudog 指向的部分旧栈的数据到新的栈上
        ncopy -= syncadjustsudogs(gp, used, &adjinfo)
    } 
    // 将源栈中的整片内存拷贝到新的栈中
    memmove(unsafe.Pointer(new.hi-ncopy), unsafe.Pointer(old.hi-ncopy), ncopy)
    // 继续调整栈中 txt、defer、panic 位置的指针
    adjustctxt(gp, &adjinfo)
    adjustdefers(gp, &adjinfo)
    adjustpanics(gp, &adjinfo)
    if adjinfo.sghi != 0 {
        adjinfo.sghi += adjinfo.delta
    } 
    // 将 G 上的栈引用切换成新栈
    gp.stack = new
    gp.stackguard0 = new.lo + _StackGuard // NOTE: might clobber a preempt request
    gp.sched.sp = new.hi - used
    gp.stktopsp += adjinfo.delta

    // 在新栈重调整指针
    gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, adjustframe, noescape(unsafe.Pointer(&adjinfo)), 0)

    if stackPoisonCopy != 0 {
        fillstack(old, 0xfc)
    }
    //释放原始栈的内存空间
    stackfree(old)
}
  1. copystack first calculates the size of the used stack space, then only the used space is copied when the stack copy is made.

  2. then call the stackalloc function to allocate a block of memory from the heap.

  3. then compare the value of hi of the old and new stack to calculate the difference between the two blocks of memory delta, this delta will be called when adjustsudogs, adjustctxt and other functions to determine the location of the old stack memory pointer, then add delta and then get the location of the new stack pointer, so that you can also adjust the pointer to the new stack.

    sobyte

  4. call memmove to copy the entire slice of memory from the source stack to the new stack.

  5. then continue to call the function that adjusts the pointers to continue to adjust the pointers to the txt, defer, and panic locations on the stack.

  6. next switch the stack reference on G to the new stack.

  7. finally call stackfree to free the memory space of the original stack.

Stack shrinkage

Stack shrinkage occurs during the phase when the stack is scanned during GC.

1
2
3
4
5
6
func scanstack(gp *g, gcw *gcWork) {
    ... 
    // 进行栈收缩
    shrinkstack(gp)
    ...
}

runtime.shrinkstack

shrinkstack This function I blocked some checksum functions, leaving only the core logic of the surface.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func shrinkstack(gp *g) {
    ...
    oldsize := gp.stack.hi - gp.stack.lo
    newsize := oldsize / 2 
    // 当收缩后的大小小于最小的栈的大小时,不再进行收缩
    if newsize < _FixedStack {
        return
    }
    avail := gp.stack.hi - gp.stack.lo
    // 计算当前正在使用的栈数量,如果 gp 使用的当前栈少于四分之一,则对栈进行收缩
    // 当前使用的栈包括到 SP 的所有内容以及栈保护空间,以确保有 nosplit 功能的空间
    if used := gp.stack.hi - gp.sched.sp + _StackLimit; used >= avail/4 {
        return
    }
    // 将旧栈拷贝到新收缩后的栈上
    copystack(gp, newsize)
}

The new stack is reduced to half its original size, and if it is smaller than _FixedStack (2KB) then it is not shrunk. It also calculates if the current stack is used by less than 1/4, and does not shrink if it is used by more than 1/4.

Finally, if it is determined that the stack is to be shrunk, the copystack function is called to perform the stack copy logic.

Summary

If you don’t know the memory layout, it may be difficult to understand it, because when we look at the heap, the memory grows from small to large, while the stack grows in the opposite direction, which causes the stack instruction to decrease the SP instead of increasing the stack frame.

In addition, Go uses plan9 assembly, which is a little bit less informative and seems to be a lot of trouble.