Many systems have daemons that monitor the running state of the system in the background and respond to unexpected situations as they arise. The system monitor is an important part of the Go language runtime, checking the Go language runtime at regular intervals to ensure that the program has not entered an abnormal state. This section describes the design and implementation of the Go system monitor, including its startup, execution process, and main responsibilities.

Design Principles

In operating systems that support multitasking, daemons are computer programs that run in the background and are not operated directly by the user; they typically run automatically when the operating system starts. both DaemonSet for Kubernetes and System Monitor for the Go language use a similar design to provide some general functionality.

Daemons are very efficient designs that exist throughout the life of a system, starting as the system starts and ending as the system ends. In the OS and Kubernetes, we often run processes such as database services, logging services, and monitoring services as daemons.

The Go language’s system monitoring also plays an important role by starting an internal loop that does not abort, polling the network inside the loop, preempting long-running or system-calling Goroutines, and triggering garbage collection, which, through these actions, can make the system run in a healthier state.

Monitor Loop

When a Go language program is started, the runtime calls runtime.main in the first Goroutine to start the main program, which creates a new thread on the system stack: the

1
2
3
4
5
6
7
8
9
func main() {
	...
	if GOARCH != "wasm" {
		systemstack(func() {
			newm(sysmon, nil)
		})
	}
	...
}

runtime.newm creates a new structure runtime.m that stores the functions and processors to be executed. runtime executes the system monitor without a processor, and the system monitor’s Goroutine runs directly on the created thread.

1
2
3
4
5
6
7
func newm(fn func(), _p_ *p) {
	mp := allocm(_p_, fn)
	mp.nextp.set(_p_)
	mp.sigmask = initSigmask
	...
	newm1(mp)
}

runtime.newm1 calls platform-specific runtime.newosproc to create a new thread via the system call clone and execute runtime.mstart in the new thread.

1
2
3
4
5
6
7
8
func newosproc(mp *m) {
	stk := unsafe.Pointer(mp.g0.stack.hi)
	var oset sigset
	sigprocmask(_SIG_SETMASK, &sigset_all, &oset)
	ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
	sigprocmask(_SIG_SETMASK, &oset, nil)
	...
}

In the newly created thread, we execute the runtime.sysmon stored in runtime.m to start the system monitoring.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
func sysmon() {
	sched.nmsys++
	checkdead()

	lasttrace := int64(0)
	idle := 0
	delay := uint32(0)
	for {
		if idle == 0 {
			delay = 20
		} else if idle > 50 {
			delay *= 2
		}
		if delay > 10*1000 {
			delay = 10 * 1000
		}
		usleep(delay)
		...
	}
}

When the above function is just called at runtime, it will first check for deadlocks via runtime.checkdead and then enter the core monitoring loop; the system monitoring hangs the current thread via usleep at the beginning of each loop, the argument to this function is microseconds and the runtime will follow the following rules to determine the hibernation time.

  • The initial hibernation time is 20 μs.
  • the maximum hibernation time is 10ms.
  • when the system monitor does not wake up the Goroutine for 50 cycles, the hibernation time is multiplied in each cycle.

Once the program stabilizes, the system monitor’s trigger time stabilizes at 10 ms. In addition to checking for deadlocks, it does the following in the loop.

  • Run timer - get the next timer that needs to be triggered.
  • Polling Network - getting the due file descriptors that need to be processed.
  • Preemption Processor - preempts Goroutines that have been running for a long time or are in a system call.
  • Garbage collection - triggers garbage collection to reclaim memory when conditions are met.

We will describe in this section in turn how system monitoring accomplishes several of these different tasks.

Checking for deadlocks

The system monitor checks for deadlocks at runtime with runtime.checkdead, and we can break the process of checking for deadlocks into three steps as follows.

  1. checking for the existence of a running thread.
  2. check for the presence of a running Goroutine.
  3. checking for the presence of a timer on the processor.

This function first checks the number of running threads in the Go language runtime, and we calculate the result of this value using several fields in the scheduler.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func checkdead() {
	var run0 int32
	run := mcount() - sched.nmidle - sched.nmidlelocked - sched.nmsys
	if run > run0 {
		return
	}
	if run < 0 {
		print("runtime: checkdead: nmidle=", sched.nmidle, " nmidlelocked=", sched.nmidlelocked, " mcount=", mcount(), " nmsys=", sched.nmsys, "\n")
		throw("checkdead: inconsistent counts")
	}
	...
}
  1. runtime.mcount gets the number of threads present in the system based on the next thread id to be created and the number of threads released.
  2. nmidle is the number of threads that are idle.
  3. nmidlelocked is the number of threads in a locked state.
  4. nmsys is the number of threads in the system call.

Using the above thread-related data, we can get the number of running threads. If the number of threads is greater than 0, it means that there is no deadlock in the current program; if the number of threads is less than 0, it means that the state of the current program is inconsistent; if the number of threads is equal to 0, we need to further check the running state of the program.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func checkdead() {
	...
	grunning := 0
	for i := 0; i < len(allgs); i++ {
		gp := allgs[i]
		if isSystemGoroutine(gp, false) {
			continue
		}
		s := readgstatus(gp)
		switch s &^ _Gscan {
		case _Gwaiting, _Gpreempted:
			grunning++
		case _Grunnable, _Grunning, _Gsyscall:
			print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n")
			throw("checkdead: runnable g")
		}
	}
	unlock(&allglock)
	if grunning == 0 {
		throw("no goroutines (main called runtime.Goexit) - deadlock!")
	}
	...
}
  1. when there are Goroutines in the _Grunnable, _Grunning, and _Gsyscall states, it means that the program has deadlocked.
  2. when all Goroutines are in the _Gidle, _Gdead, and _Gcopystack states, it means that the main program called runtime.goexit.

When there is a waiting Goroutine at runtime and there is no running Goroutine, we check the timer present in the processor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func checkdead() {
	...
	for _, _p_ := range allp {
		if len(_p_.timers) > 0 {
			return
		}
	}

	throw("all goroutines are asleep - deadlock!")
}

If there are waiting timers in the processor, it makes sense for all Goroutines to fall asleep, but if there are no waiting timers, the run will simply report an error and exit the program.

Running timers

In the system monitor loop, we use runtime.nanotime and runtime.timeSleepUntil to get the current time and the next time the timer needs to wake up; when the scheduler needs to perform garbage collection or when all processors are idle, the system monitor can temporarily fall asleep if there is no timer that needs to be triggered : The

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
func sysmon() {
	...
	for {
		...
		now := nanotime()
		next, _ := timeSleepUntil()
		if debug.schedtrace <= 0 && (sched.gcwaiting != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs)) {
			lock(&sched.lock)
			if atomic.Load(&sched.gcwaiting) != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs) {
				if next > now {
					atomic.Store(&sched.sysmonwait, 1)
					unlock(&sched.lock)
					sleep := forcegcperiod / 2
					if next-now < sleep {
						sleep = next - now
					}
					...
					notetsleep(&sched.sysmonnote, sleep)
					...
					now = nanotime()
					next, _ = timeSleepUntil()
					lock(&sched.lock)
					atomic.Store(&sched.sysmonwait, 0)
					noteclear(&sched.sysmonnote)
				}
				idle = 0
				delay = 20
			}
			unlock(&sched.lock)
		}
		...
		if next < now {
			startm(nil, false)
		}
	}
}

The duration of hibernation is determined by the forced GC period forcegcperiod and the time when the timer is next triggered. runtime.notesleep uses a semaphore to synchronize the system monitor to the hibernation state. When the system monitor is woken up, we recalculate the current time and the next timer to be triggered, call runtime.noteclear to notify the system monitor of the wake-up and reset the hibernation interval.

If after this we find that the next timer needs to be triggered at a time less than the current time, which also indicates that all threads are probably busy running Goroutine, System Monitor will start a new thread to trigger the timer to avoid a large deviation in the timer’s expiration time.

Polling the network

If 10ms have passed since the last polling of the network, the system monitor also polls the network in a loop to check for pending file descriptors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func sysmon() {
	...
	for {
		...
		lastpoll := int64(atomic.Load64(&sched.lastpoll))
		if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {
			atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now))
			list := netpoll(0)
			if !list.empty() {
				incidlelocked(-1)
				injectglist(&list)
				incidlelocked(1)
			}
		}
		...
	}
}

The above function non-blockingly calls runtime.netpoll to check for pending file descriptors and adds all ready Goroutines to the global run queue via runtime.injectglist.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func injectglist(glist *gList) {
	if glist.empty() {
		return
	}
	lock(&sched.lock)
	var n int
	for n = 0; !glist.empty(); n++ {
		gp := glist.pop()
		casgstatus(gp, _Gwaiting, _Grunnable)
		globrunqput(gp)
	}
	unlock(&sched.lock)
	for ; n != 0 && sched.npidle != 0; n-- {
		startm(nil, false)
	}
	*glist = gList{}
}

This function switches the state of all Goroutines from _Gwaiting to _Grunnable and adds them to the global run queue waiting to run, and if there are free processors in the current program, it will start threads to execute those tasks via runtime.startm.

Preempting processors

The system monitor calls runtime.retake in a loop to seize a processor that is running or in a system call. This function iterates through the global processors at runtime, each of which stores a runtime.sysmontick.

1
2
3
4
5
6
type sysmontick struct {
	schedtick   uint32
	schedwhen   int64
	syscalltick uint32
	syscallwhen int64
}

The four fields in this structure store the number of times the processor was scheduled, the last time the processor was scheduled, the number of system calls, and the time of the system call. runtime.retake’s loop contains two different types of preemption logic.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func retake(now int64) uint32 {
	n := 0
	for i := 0; i < len(allp); i++ {
		_p_ := allp[i]
		pd := &_p_.sysmontick
		s := _p_.status
		if s == _Prunning || s == _Psyscall {
			t := int64(_p_.schedtick)
			if pd.schedwhen+forcePreemptNS <= now {
				preemptone(_p_)
			}
		}

		if s == _Psyscall {
			if runqempty(_p_) && atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now {
				continue
			}
			if atomic.Cas(&_p_.status, s, _Pidle) {
				n++
				_p_.syscalltick++
				handoffp(_p_)
			}
		}
	}
	return uint32(n)
}
  1. when the processor is in the _Prunning or _Psyscall state, we preempt the current processor with runtime.preemptone if 10ms have elapsed since the last scheduling trigger.
  2. when the processor is in the _Psyscall state, runtime.handoffp is called to give up the use of the processor when both of the following conditions are met.
    1. when the processor’s run queue is not empty or no free processor exists.
    2. when the system call time exceeds 10ms.

System monitoring avoids starvation problems caused by the same Goroutine taking up threads for too long by preempting processors in a loop.

Garbage collection

At the end, the system monitor also determines if a forced garbage collection needs to be triggered. runtime.sysmon builds runtime.gcTrigger and calls the runtime.gcTrigger.test method to determine if a garbage collection needs to be triggered.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func sysmon() {
	...
	for {
		...
		if t := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() && atomic.Load(&forcegc.idle) != 0 {
			lock(&forcegc.lock)
			forcegc.idle = 0
			var list gList
			list.push(forcegc.g)
			injectglist(&list)
			unlock(&forcegc.lock)
		}
		...
	}
}

If garbage collection needs to be triggered, we add the Goroutine used for garbage collection to the global queue and let the scheduler choose the appropriate processor to execute it.

Summary

Runtime system monitoring triggers thread preemption, network polling, and garbage collection to ensure the availability of the Go language runtime. System monitoring is a good solution to the tail latency problem, reducing the scheduler’s starvation problem for scheduling Goroutines and ensuring that timers are triggered at the most accurate times possible.