Francisco J. Ballesteros
TR LSUB-15-2 draft
Abstract
This TR contains notes about the Go run time as of Go 1.4. They are intended to aid in the construction of a kernel for Clive.1. Initialization
rt0_darwin_amd64.s:13 main() is the
entry point for the binary. A direct jump to...
asm_amd64.s:9: runtime·rt0_go which is
the actual crt0. This initializes the stack calls
_cgo_init if needed, updates stack guards,
setups TLS, sets m to
m0 and g to
g0 and continues calling...
runtime·argsto save the arguments from the OSos_darwin.c:54: runtime·osinitto initialize threading (they don't use their libc)proc.c:122: runtime·schedinit(void)to create a G that callsruntime·main.traceback.go:48: func tracebackinit()symtab.go:46: func symtabinitstack.c:42: runtime·stackinit(void)calls this function for the stacks in the pool:mheap.c:663: runtime·MSpanList_Init
malloc.c:109: runtime·mallocinitmheap.c:57: runtime·MHeap_Initinitializes the heap using these:mfixalloc.c:16: runtime·FixAlloc_Init, for several allocs.mheap.c:663: runtime·MSpanList_Init, for free and busy lists.mcentral.c:21: runtime·MCentral_Init, for mcentral lists.
mcache.c:19: runtime·allocmcachesetsg->m->mcache, a per-P malloc cache for small objects.
proc.c:190: mcommoninitcalls os-specificmpreinitand links m into the sched list (allm).goargsandgoenvssave the UNIX arguments.mgc0.c:1345: runtime·gcinitsets GC masks for data and bss.proc.c:2655: procresizeis called to create Ps forGOMAXPROCS.
proc.c:2154: runtime·newprocis called to runruntime·mainusingasm_amd64.s:218: runtime·onMto callnewproc_m)to runruntime·main. This switches to g0, if not on m stack (i.e., no switch this time).proc.c:2128: newproc_mis called by g0 and copies the args and then calls:proc.c:2182: runtime·newproc1. This creates a G to run a function (runtime·main). Using:proc.c:2295: gfgetto get a G from the free list, orproc.c:2102: runtime·malgto allocate a G, which ends up doing anew(g)in go.
stack.c:806: runtime·gostartcallfn, similar to set-label.
proc.c:3284: runqputto put the new G on a local runq or the global one.proc.c:1276: wakep, which is a nop or a call to sched the M for P or make a new M for P.proc.c:1200: startm
proc.c:818: runtime·mstartstarts the M. It calls:asminitto do per-thread initialization (nop for our arch).os_darwin.c:144: minitinitializes signal handling for our arch.proc.c:1537: scheduleruns the scheduller and calls this when gets something to run:proc.c:1358: executeis mostly a call to a goto-label:asm_amd64.s:143: gogo
proc.go:16: runtime·main(), which calls
runtime_initmain_initmain_mainexit(0)
runtime·init function in
proc.go spawns the goroutine to call
gogc(1) in a loop.
2. Labels
AGobuf is similar to a
Label:
struct Gobuf
{
uintptr sp; // the actual label.
uintptr pc; // the actual label.
G* g; // the actual label.
void* ctxt; // aux context pointer.
uintreg ret; // return value
uintptr lr; // link register used in ARM.
};
The main operations are:
runtime·gosave, i.e., set label. Clearsctxtandret.runtime·gogo, i.e., goto label. SetsctxtinDX, used byruntime·morestackand returnsretafter the actual jump to the label.
pc, sp, and
g are the actual context saved and restored.
That is the label.
The field ctxt is used to point to a frame in
the stack and seems to be the user context in other places. Seems to
be an auxiliary pointer but it is not clear (yet) how it is used. The
field lr seems to be the link register for
the ARM and is not used by our architecture.
3. Sleep/Wakeup
These are giant locks:
void runtime·stoptheworld(void);
void runtime·starttheworld(void);
extern uint32 runtime·worldsema;
used for example to dump all the stacks.
These are the usual locks with a user-level fast path:
void runtime·lock(Mutex*);
void runtime·unlock(Mutex*);
These are sleep/wakeup like structures, and timed out variants for them with the usual conventions of at most one calling sleep and at most one calling wakeup for it. Plus, there is a clear operation to reset the thing for a further use:
void runtime·noteclear(Note*);
void runtime·notesleep(Note*);
void runtime·notewakeup(Note*);
bool runtime·notetsleep(Note*, int64); // false - timeout
bool runtime·notetsleepg(Note*, int64); // false - timeout
And these are futexes or semaphores (eg., on Darwin) to implement the ones above:
uintptr runtime·semacreate(void);
int32 runtime·semasleep(int64);
void runtime·semawakeup(M*);
In many cases lock-free data structures are used; or at least parts of them are handled lock-free by using atomic and CAS-like operations, eg. to change statuses of processes and to link them to a list.
4. Scheduling
Most of the interesting code is inproc.c.
There are three central structures:
- G
-
A goroutine.
grefers to the current andg0is the idle process. This is a user-level thread.gis a register on the ARM and a slot in TLS everywhere else. - M
-
A machine, actually a UNIX
process.
mrefers to the current one. This is a kernel-level thread that runs Gs or is idle or is in a syscall. - P
-
A processor, actually a per
GOMAXPROCSsched.
The interesting bits of G are:
struct G
{
Stack stack; // [stack.lo, stack.hi).
uintptr stackguard0; // stack.lo+StackGuard or StackPreempt
uintptr stackguard1; // stack.lo+StackGuard on g0 or ~0 on others
Panic* panic; // innermost panic - offset known to liblink
Defer* defer; // innermost defer
Gobuf sched;
uintptr syscallsp; // if status==Gsyscall, syscallsp = sched.sp to use during gc
uintptr syscallpc; // if status==Gsyscall, syscallpc = sched.pc to use during gc
void* param; // passed parameter on wakeup
int64 goid;
G* schedlink;
bool preempt; // preemption signal, dup of stackguard0 = StackPreempt
M* m; // for debuggers, but offset not hard-coded
M* lockedm;
SudoG* waiting; // sudog structures this G is waiting on
...
};
The interesting bits of M are:
struct M
{
G* g0; // goroutine with scheduling stack
G* gsignal; // signal-handling G
uintptr tls[4]; // thread-local storage (for x86 extern register)
void (*mstartfn)(void);
G* curg; // current running goroutine
G* caughtsig; // goroutine running during fatal signal
P* p; // attached P for executing Go code (nil if not executing Go code)
int32 mallocing,throwing,gcing;
int32 locks;
bool spinning; // M is out of work and is actively looking for work
bool blocked; // M is blocked on a Note
Note park;
M* alllink; // on allm
M* schedlink;
MCache* mcache;
G* lockedg;
uint32 locked; // tracking for LockOSThread
M* nextwaitm; // next M waiting for lock
uintptr waitsema; // semaphore for parking on locks
uint32 waitsemacount;
uint32 waitsemalock;
bool (*waitunlockf)(G*, void*);
void* waitlock;
uintptr scalararg[4]; // scalar argument/return for mcall
void* ptrarg[4]; // pointer argument/return for mcall
...
};
A value of m->locks greater than zero
prevents preemption and also prevents garbage colloection.
The interesting bits of P are:
struct P
{
Mutex lock;
uint32 status; // Pidle, Prunning, Psyscall, Pgcstop, Pdead
P* link;
uint32 schedtick; // incremented on every scheduler call
uint32 syscalltick; // incremented on every system call
M* m; // back-link to associated M (nil if idle)
MCache* mcache;
Defer* deferpool[5]; // pool of available Defers
// Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
uint64 goidcache;
uint64 goidcacheend;
// Queue of runnable goroutines.
uint32 runqhead;
uint32 runqtail;
G* runq[256];
// Available G's (status == Gdead)
G* gfree;
int32 gfreecnt;
};
This is a local scheduler plus cached structures per UNIX process used to run Go code. And then there is a global scheduler structure that also caches some structures.
struct SchedT
{
Mutex lock;
uint64 goidgen;
M* midle; // idle m's waiting for work
P* pidle; // idle P's
G* runqhead; // Global runnable queue.
Mutex gflock; // global cache go dead Gs
G* gfree;
uint32 gcwaiting; // gc is waiting to run
int32 stopwait;
Note stopnote;
...
};
There is a single runtime·sched, and
runtime·sched.lock protects it all.
4.1. Process creation
Code likego fn() is translated as a call to
proc.c:2154: runtime·newproc as shown in the
first section for initialization. This records the function and a
pointer to the ... arguments yn
g->m->ptrarg[] and then calls
newproc_m using
runtime·onM.
- There's a switch to
M'sg0 newproc_mruns there.- And there's a switch back when done.
newproc_m takes the function and
arguments from m->ptrarg and calls
runtime·newproc1:
- This adds to
m->locksto avoid preemption and takes P fromm->p. - A G is taken from the cache at
P or allocated (with
StackMinstack bytes).
Gdead. Then arguments are copied into
G's stack and it's prepared:
- It's label
pc,sp, andgare set to the function pointer and the new stack andg, and then adjusted to pretend that such function did callgosave, so we cangogo(gotolabel) it. In this case, the labelctxtis set to theFuncValvalue going. The return value is set to callgoexit+PCQuantumwhich is a call togoexit1. - Its state is changed to
Grunnable(cas op.) - A new
idis taken from the id cache atP(perhaps refilling the cache).
runqput(p,
newg). This uses atomic load/store to put
newg at
p->runq[tail], using
p->runqtail as a increasing counter for the
tail that is copied to tail and truncated as
an index for runq. This happens if
p->runq is not full.
If p->runq is full,
runqputslow is called to take half of the
local queue at P and place them at the global
runtime·sched.runqhead/tail queue while
holding the global scheduler lock. if the CAS operation to operate on
the local queue fails, the whole runqput is
retried.
4.2. Process termination
Performed by a call toproc.c:1682: runtime·goexit1, that does anmcallto rungoexit0atg0.goexit0(or any other mcall) never returns and callsgogoto sched to somewhere else, usually tog->schedto let it run later. The G state is set toGdead. Then these are called:proc.c:1598: dropg, to releasem->curg->mandm->curg(gis nowg0).proc.c:2259: gfput, to put the dying G atp->gfree, linked throughg->schedlink. If there are too many free Gs cached, they are moved to the global list atruntime·sched.gfree.schedule, runs one scheduler loop and jumps to a new G.
asm_amd64.s:2237: runtime·goexit, which callsruntime·goexit1.
4.3. Context switches
Calls toschedule can be found at:
proc.c:661: mquiesce, used to run code atg0and then schedule back a G.proc.c:868: mstart, to jump to a newg0for a new M.proc.c:1655: runtime·park_m, used to make G wait for something.proc.c:1675: runtime·gosched_m, anmcallfromGoschedatproc.go.proc.c:1718: goexit0, to rung0or other Gs when exiting.proc.c:2022: exitsyscall0, called after a system call atg0.malloc.go:482: gogccallsGoschedwhen the GC is done.mgc0.go:87: bgsweepcallsGosched.
g->stackguard0 is
StackPreempt, then
asm_amd64.s:824:runtime·stackcheck calls
morestack and preemts if the goroutine seems
to be running user code (not runtime code), by calling
gosched_m like
Gosched does. I have not checked out if the
compiler inserts calls in loops that do not perform function calls
(which check the stack), and those might never be preempted; which
does not matter much for us now becase they would be bugs in the code
and not the normal case. The stack checks are inserted silently by the
linker unless NOSPLIT is indicated, and thus
normal function calls are a source of possible preemptions.
As a side-efect, a NOSPLIT function call is
not preemptible by this mechanism.
The code in schedule runs the scheduler:
- There can be no
m->locks, or it's a bug. - If
m->lockedgthenproc.c:1287: stoplockedmis called, to stop the current G at M until G can run again. This is done by lendingm->pto another M, if we have a P usinghandoffp(releasep()), and then callingruntime·notesleep(&g->m->park)to sleep. Lateracquirep(m->nextp)is used to get a new P before returning and...proc.c:1358: executeis called form->lockedg.
- If
runtime·sched.gcwaitingthengcstopmis called, which callsreleasepandstopm; This happens whenstoptheworldis called and, once we are doneschedulerestarts again.
schedule picks a ready
G. Once in a while from the global pool, and
usually from m->p.
proc.c:3333: runqgetpicks one G from the queue usingp->runqheadand CAS.- When there is no G ready to run,
proc.c:1381: findrunnabletries to steal callingglobrunqgetand,runtime·netpoll(!!). - Then if there's no work,
m->spinningis set and it tries to steal from others. - When everything fails, it calls
stopmand blocks.
- When there is no G ready to run,
proc.c:1358: executeis called, which is mostly a call to a goto-label:asm_amd64.s:143: gogo
5. System calls
All system calls callasm_darwin_amd64.s:19 syscall.Syscall
or Syscall6 that
- calls
proc.c:1761 runtime·reentersyscall(fromruntime·entersyscall)-
to prepare for a system call
- This increments
m->locksto avoid preemption, and tricksg->stackguard0to make sure the no split-stack happens. - Sets G status to
Gsyscall - Saves the PC and SP in
g->sched. - Releases
p->mandm->mcache - Sets P status to
Psyscall - Calls
entersyscall_gcwaitwhenruntime·sched.gcwaiting - Decrements
m->locksbefore proceding further.
- This increments
- moves parameters into registers and executes
- executes
SYSCALL, and, upon return from the system call, calls - calls
proc.c:1893 runtime·exitsyscallcalls directlyexitsyscallfastto either re-acquire the last P or or get another idle P.- or runs
exitsyscall0as anmcallto run the scheduler and then return to continue when the scheduler returns and G can run again.- This calls
pidlegetand thenacquirepandexecuteorstopmandschedule, depending.
- This calls
- and returns to the caller.
Two other functions, syscall·RawSyscall and
·RawSyscall6 are wrappers that issue the
system call without doing anything: no calls to
runtime·entersyscall and no calls to
runtime·exitsyscall. I don't know why they
are not called from the entry points above.
6. Signals
There is a globalruntime·sigtab[sig] table
with entries of this data type:
struct SigTab
{
int32 flags;
int8 *name;
};
enum
{
SigNotify = 1<<0, // let signal.Notify have signal, even if from kernel
SigKill = 1<<1, // if signal.Notify doesn't take it, exit quietly
SigThrow = 1<<2, // if signal.Notify doesn't take it, exit loudly
SigPanic = 1<<3, // if the signal is from the kernel, panic
SigDefault = 1<<4, // if the signal isn't explicitly requested, don't monitor it
SigHandling = 1<<5, // our signal handler is registered
SigIgnored = 1<<6, // the signal was ignored before we registered for it
SigGoExit = 1<<7, // cause all runtime procs to exit (only used on Plan 9).
};
Signals are handled in their own signal stack. When a new M is initialized,
runtime·minitis called, eg., for Darwin, it callssys_darwin_amd64.s:265: runtime·sigaltstackto set the signal stack context, andsys_darwin_386.s:218: runtime·sigprocmaskto set the signal mask to none.
dropm is called to release an
M,
runtime·unminitis called and it callsruntime·signalstackto unset the signal stack.
signal package
init function:
- calls
runtime·signal_enable, which when called for the first time setssig.inuseand clears thesig.notenote, to prepare for handling signals. - loops calling
process(syscall.Signal(signal_recv())). Here,sig.s:21 signal_recvis just a jump tosigqueue.go:91 runtime·signal_recv, which returns a signal number when it's posted by the OS.
signal.Notify is called to install a
handler:
signal.go:49 Notifyrecords in ahandlerstable for the signal given or for all the signals, and callssignal_unix.go:47 signal·enableSignal, which is just a call tosigqueue.go:128 runtime·signal_enable. This sets the signal insig.wantedand callssigenable_gowhich is just a call, usingonMto this:signal.c:8 runtime·sigenable_menables a signal calling with the signal number kept atm->scalararg[0]. It is just a call to:signal_unix.c:44 runtime·sigenableenables a signal, making a call to the next withruntime·sighandleras the handler function.os_darwin.c:519: runtime·setsigcallsruntime·sigactionspecifyingruntime·sigtrampas the handler and supplying the handler function (runtime·sighandler) in the sigaction structure.
sys_darwin_386.s:240 runtime·sigtramp- saves
gand setsm->gsignalas the current G, and then calls the actual handler:signal_amd64x.c:44 runtime·sighandlerlooks into the globalruntime·sigtaband might panic if not handled or deliver the signal by calling the next.sigqueue.go:48 runtime·sigsendis called with a signal number to send the signal to receivers. The signal is queued in a global if there is no receiver ready but the signal is wanted and is not already in the queue.
- saves
singal·init function would send the signal to
the users channel with a non-blocking send. That is, if there is not
enough buffering in the channel signals are lost; but that's ok.
7. Memory allocation
Memory is allocated by calls tonew. The
implementation for this builtin is:
malloc.go:348 newobject, which calls tomallocgcusingflagNoScanas a flag if the type is marked as having no pointers.malloc.go:46 mallocgcis the main entry point for memory allocation.
malloc.go:357
newarray, which does the same. Raw memory is allocated by
calls to rawmem which also calls
mallocgc with flags
flagNoScan and
flagNoZero; i.e., that's almost a direct
malloc call, but for the GC.
There are several allocation classes. Sizes are recorded in
runtime·class_to_size[]. Alignments are:
- 8, for sizes under 16 bytes and sizes that are not a power of 2.
- 16, for sizes under 128 bytes,
- size/8, for sizes under 2KiB,
- 256, for sizes starting at 2KiB.
- Sizes under
MaxSmallSize(32KiB), use the larger size that keeps the same number of objects in each page. They are allocated from a per-P cache. - Sizes starting at 32KiB are rounded to
PageSize. They are allocated from the global heap.
- Objects with no pointers and less than 16 bytes
(
maxTinySize) are allocated within a single allocation and share it. This is done in the P cache. The allocation is released when all such objects are collected. This is done in the P cache. - Objects up to 1024-8 bytes go into a size class rounded to multiples of 8 bytes.
- Objects up to 32KiB are rounded to multiples of 128 bytes. This is done in the P cache.
- Objects larger than 32KiB are allocated on the heap by calling
largeAlloc_monM.
- The tiny allocations use the
tinyallocation from theMCachestructure atp->mcache. There is one per P and it requires no locks:struct MCache { byte* tiny; uintptr tinysize; // The rest is not accessed on every malloc. MSpan* alloc[NumSizeClasses]; // spans to allocate from StackFreeList stackcache[NumStackOrders]; SudoG* sudogcache; ... };When
tinyis exhausted, a new one is taken fromMcache.alloc, and, if that is exhausted, a new one is allocated by a call toruntime·MCache_Refill, usingonM. This asksruntime·MCentral_CacheSpanto allocate a new span and places it intoMache.alloc. - The allocations with pointers or starting at 16 bytes, try first
to use
p->cache.allocto get an allocation of the right size. If this fails,MCache_Refillis called as before,onM. - Large allocations starting at 32KiB call
largeAlloc_monM.malloc.c:372: runtime·largeAlloc_mrounds the size to a multiple ofPageSizeand calls:mheap.c:231: runtime·MHeap_Alloccallsmheap_allocdirectly if it's a call made byg0, ormcalls it.mheap.c:171 mheap_allocallocates n memory pages.
mgc0.c:1815: runtime·markspanis called to tell the GC.
flagNoScan, the GC
bitmap is updated by looking at the type given to
mallocgc to record which words are pointers.
Also, the unused part at the end of the actual allocation is marked in
the GC bitmap as dead.
Looking at the MHeap now, for allocations
under 1Mbyte (for 4KiB pages), there is a list of allocation with
exactly that page size. For larger allocations there is a final large
list used.