Breaks windows and race detector.
TBR=rsc
««« original CL description
runtime: stack allocator, separate from mallocgc
In order to move malloc to Go, we need to have a
separate stack allocator. If we run out of stack
during malloc, malloc will not be available
to allocate a new stack.
Stacks are the last remaining FlagNoGC objects in the
GC heap. Once they are out, we can get rid of the
distinction between the allocated/blockboundary bits.
(This will be in a separate change.)
Fixes#7468Fixes#7424
LGTM=rsc, dvyukov
R=golang-codereviews, dvyukov, khr, dave, rsc
CC=golang-codereviews
https://golang.org/cl/104200047
»»»
TBR=rsc
CC=golang-codereviews
https://golang.org/cl/101570044
In order to move malloc to Go, we need to have a
separate stack allocator. If we run out of stack
during malloc, malloc will not be available
to allocate a new stack.
Stacks are the last remaining FlagNoGC objects in the
GC heap. Once they are out, we can get rid of the
distinction between the allocated/blockboundary bits.
(This will be in a separate change.)
Fixes#7468Fixes#7424
LGTM=rsc, dvyukov
R=golang-codereviews, dvyukov, khr, dave, rsc
CC=golang-codereviews
https://golang.org/cl/104200047
The runtime has historically held two dedicated values g (current goroutine)
and m (current thread) in 'extern register' slots (TLS on x86, real registers
backed by TLS on ARM).
This CL removes the extern register m; code now uses g->m.
On ARM, this frees up the register that formerly held m (R9).
This is important for NaCl, because NaCl ARM code cannot use R9 at all.
The Go 1 macrobenchmarks (those with per-op times >= 10 µs) are unaffected:
BenchmarkBinaryTree17 5491374955 5471024381 -0.37%
BenchmarkFannkuch11 4357101311 4275174828 -1.88%
BenchmarkGobDecode 11029957 11364184 +3.03%
BenchmarkGobEncode 6852205 6784822 -0.98%
BenchmarkGzip 650795967 650152275 -0.10%
BenchmarkGunzip 140962363 141041670 +0.06%
BenchmarkHTTPClientServer 71581 73081 +2.10%
BenchmarkJSONEncode 31928079 31913356 -0.05%
BenchmarkJSONDecode 117470065 113689916 -3.22%
BenchmarkMandelbrot200 6008923 5998712 -0.17%
BenchmarkGoParse 6310917 6327487 +0.26%
BenchmarkRegexpMatchMedium_1K 114568 114763 +0.17%
BenchmarkRegexpMatchHard_1K 168977 169244 +0.16%
BenchmarkRevcomp 935294971 914060918 -2.27%
BenchmarkTemplate 145917123 148186096 +1.55%
Minux previous reported larger variations, but these were caused by
run-to-run noise, not repeatable slowdowns.
Actual code changes by Minux.
I only did the docs and the benchmarking.
LGTM=dvyukov, iant, minux
R=minux, josharian, iant, dave, bradfitz, dvyukov
CC=golang-codereviews
https://golang.org/cl/109050043
The 'continuation pc' is where the frame will continue
execution, if anywhere. For a frame that stopped execution
due to a CALL instruction, the continuation pc is immediately
after the CALL. But for a frame that stopped execution due to
a fault, the continuation pc is the pc after the most recent CALL
to deferproc in that frame, or else 0. That is where execution
will continue, if anywhere.
The liveness information is only recorded for CALL instructions.
This change makes sure that we never look for liveness information
except for CALL instructions.
Using a valid PC fixes crashes when a garbage collection or
stack copying tries to process a stack frame that has faulted.
Record continuation pc in heapdump (format change).
Fixes#8048.
LGTM=iant, khr
R=khr, iant, dvyukov
CC=golang-codereviews, r
https://golang.org/cl/100870044
mstats.last_gc is unix time now, it is compared with abstract monotonic time.
On my machine GC is forced every 5 mins regardless of last_gc.
LGTM=rsc
R=golang-codereviews
CC=golang-codereviews, iant, rsc
https://golang.org/cl/91350045
When I did the original 386 ports on Linux and OS X, I chose to
define GS-relative expressions like 4(GS) as relative to the actual
thread-local storage base, which was usually GS but might not be
(it might be FS, or it might be a different constant offset from GS or FS).
The original scope was limited but since then the rewrites have
gotten out of control. Sometimes GS is rewritten, sometimes FS.
Some ports do other rewrites to enable shared libraries and
other linking. At no point in the code is it clear whether you are
looking at the real GS/FS or some synthesized thing that will be
rewritten. The code manipulating all these is duplicated in many
places.
The first step to fixing issue 7719 is to make the code intelligible
again.
This CL adds an explicit TLS pseudo-register to the 386 and amd64.
As a register, TLS refers to the thread-local storage base, and it
can only be loaded into another register:
MOVQ TLS, AX
An offset from the thread-local storage base is written off(reg)(TLS*1).
Semantically it is off(reg), but the (TLS*1) annotation marks this as
indexing from the loaded TLS base. This emits a relocation so that
if the linker needs to adjust the offset, it can. For example:
MOVQ TLS, AX
MOVQ 8(AX)(TLS*1), CX // load m into CX
On systems that support direct access to the TLS memory, this
pair of instructions can be reduced to a direct TLS memory reference:
MOVQ 8(TLS), CX // load m into CX
The 2-instruction and 1-instruction forms correspond roughly to
ELF TLS initial exec mode and ELF TLS local exec mode, respectively.
Liblink applies this rewrite on systems that support the 1-instruction form.
The decision is made using only the operating system (and probably
the -shared flag, eventually), not the link mode. If some link modes
on a particular operating system require the 2-instruction form,
then all builds for that operating system will use the 2-instruction
form, so that the link mode decision can be delayed to link time.
Obviously it is late to be making changes like this, but I despair
of correcting issue 7719 and issue 7164 without it. To make sure
I am not changing existing behavior, I built a "hello world" program
for every GOOS/GOARCH combination we have and then worked
to make sure that the rewrite generates exactly the same binaries,
byte for byte. There are a handful of TODOs in the code marking
kludges to get the byte-for-byte property, but at least now I can
explain exactly how each binary is handled.
The targets I tested this way are:
darwin-386
darwin-amd64
dragonfly-386
dragonfly-amd64
freebsd-386
freebsd-amd64
freebsd-arm
linux-386
linux-amd64
linux-arm
nacl-386
nacl-amd64p32
netbsd-386
netbsd-amd64
openbsd-386
openbsd-amd64
plan9-386
plan9-amd64
solaris-amd64
windows-386
windows-amd64
There were four exceptions to the byte-for-byte goal:
windows-386 and windows-amd64 have a time stamp
at bytes 137 and 138 of the header.
darwin-386 and plan9-386 have five or six modified
bytes in the middle of the Go symbol table, caused by
editing comments in runtime/sys_{darwin,plan9}_386.s.
Fixes#7164.
LGTM=iant
R=iant, aram, minux.ma, dave
CC=golang-codereviews
https://golang.org/cl/87920043
Given
type Outer struct {
*Inner
...
}
the compiler generates the implementation of (*Outer).M dispatching to
the embedded Inner. The implementation is logically:
func (p *Outer) M() {
(p.Inner).M()
}
but since the only change here is the replacement of one pointer
receiver with another, the actual generated code overwrites the
original receiver with the p.Inner pointer and then jumps to the M
method expecting the *Inner receiver.
During reflect.Value.Call, we create an argument frame and the
associated data structures to describe it to the garbage collector,
populate the frame, call reflect.call to run a function call using
that frame, and then copy the results back out of the frame. The
reflect.call function does a memmove of the frame structure onto the
stack (to set up the inputs), runs the call, and the memmoves the
stack back to the frame structure (to preserve the outputs).
Originally reflect.call did not distinguish inputs from outputs: both
memmoves were for the full stack frame. However, in the case where the
called function was one of these wrappers, the rewritten receiver is
almost certainly a different type than the original receiver. This is
not a problem on the stack, where we use the program counter to
determine the type information and understand that during (*Outer).M
the receiver is an *Outer while during (*Inner).M the receiver in the
same memory word is now an *Inner. But in the statically typed
argument frame created by reflect, the receiver is always an *Outer.
Copying the modified receiver pointer off the stack into the frame
will store an *Inner there, and then if a garbage collection happens
to scan that argument frame before it is discarded, it will scan the
*Inner memory as if it were an *Outer. If the two have different
memory layouts, the collection will intepret the memory incorrectly.
Fix by only copying back the results.
Fixes#7725.
LGTM=khr
R=khr
CC=dave, golang-codereviews
https://golang.org/cl/85180043
The software floating point runs with m->locks++
to avoid being preempted; recognize this case in panic
and undo it so that m->locks is maintained correctly
when panicking.
Fixes#7553.
LGTM=dvyukov
R=golang-codereviews, dvyukov
CC=golang-codereviews
https://golang.org/cl/84030043
This is the same check we use during stack copying.
The check cannot be applied to C stack frames, even
though we do emit pointer bitmaps for the arguments,
because (1) the pointer bitmaps assume all arguments
are always live, not true of outputs during the prologue,
and (2) the pointer bitmaps encode interface values as
pointer pairs, not true of interfaces holding integers.
For the rest of the frames, however, we should hold ourselves
to the rule that a pointer marked live really is initialized.
The interface scanning already implicitly checks this
because it interprets the type word as a valid type pointer.
This may slow things down a little because of the extra loads.
Or it may speed things up because we don't bother enqueuing
nil pointers anymore. Enough of the rest of the system is slow
right now that we can't measure it meaningfully.
Enable for now, even if it is slow, to shake out bugs in the
liveness bitmaps, and then decide whether to turn it off
for the Go 1.3 release (issue 7650 reminds us to do this).
The new m->traceback field lets us force printing of fp=
values on all goroutine stack traces when we detect a
bad pointer. This makes it easier to understand exactly
where in the frame the bad pointer is, so that we can trace
it back to a specific variable and determine what is wrong.
Update #7650
LGTM=khr
R=khr
CC=golang-codereviews
https://golang.org/cl/80860044
Change two-bit stack map entries to encode:
0 = dead
1 = scalar
2 = pointer
3 = multiword
If multiword, the two-bit entry for the following word encodes:
0 = string
1 = slice
2 = iface
3 = eface
That way, during stack scanning we can check if a string
is zero length or a slice has zero capacity. We can avoid
following the contained pointer in those cases. It is safe
to do so because it can never be dereferenced, and it is
desirable to do so because it may cause false retention
of the following block in memory.
Slice feature turned off until issue 7564 is fixed.
Update #7549
LGTM=rsc
R=golang-codereviews, bradfitz, rsc
CC=golang-codereviews
https://golang.org/cl/76380043
Structured Exception Handling (SEH) was the first way to handle
exceptions (memory faults, divides by zero) on Windows.
The S might as well stand for "stack-based": the implementation
interprets stack addresses in a few different ways, and it gets
subtly confused by Go's management of stacks. It's also something
that requires active maintenance during cgo switches, and we've
had bugs in that maintenance in the past.
We have recently come to believe that SEH cannot work with
Go's stack usage. See http://golang.org/issue/7325 for details.
Vectored Exception Handling (VEH) is more like a Unix signal
handler: you set it once for the whole process and forget about it.
This CL drops all the SEH code and replaces it with VEH code.
Many special cases and 7 #ifdefs disappear.
VEH was introduced in Windows XP, so Go on windows/386 will
now require Windows XP or later. The previous requirement was
Windows 2000 or later. Windows 2000 immediately preceded
Windows XP, so Windows 2000 is the only affected version.
Microsoft stopped supporting Windows 2000 in 2010.
See http://golang.org/s/win2000-golang-nuts for details.
Fixes#7325.
LGTM=alex.brainman, r
R=golang-codereviews, alex.brainman, stephen.gutekanst, dave
CC=golang-codereviews, iant, r
https://golang.org/cl/74790043
The Solaris network poller uses event ports, which are
level-triggered. As such, it has to re-arm itself after each
wakeup. The arming mechanism (which runs in its own thread) raced
with the closing of a file descriptor happening in a different
thread. When a network file descriptor is about to be closed,
the network poller is awaken to give it a chance to remove its
association with the file descriptor. Because the poller always
re-armed itself, it raced with code that closed the descriptor.
This change makes the network poller check before re-arming if
the file descriptor is about to be closed, in which case it will
ignore the re-arming request. It uses the per-PollDesc lock in
order to serialize access to the PollDesc.
This change also adds extensive documentation describing the
Solaris implementation of the network poller.
Fixes#7410.
LGTM=dvyukov, iant
R=golang-codereviews, bradfitz, iant, dvyukov, aram.h, gobot
CC=golang-codereviews
https://golang.org/cl/69190044
This is especially important for SetPanicOnCrash,
but also useful for e.g. nil deref in mallocgc.
Panics on such crashes can't lead to anything useful,
only to deadlocks, hangs and obscure crashes.
This is a copy of broken but already LGTMed
https://golang.org/cl/68540043/
TBR=rsc
R=rsc
CC=golang-codereviews
https://golang.org/cl/75320043
Thanks to Ian for spotting these.
runtime.h: define uintreg correctly.
stack.c: address warning caused by the type of uintreg being 32 bits on amd64p32.
Commentary (mainly for my own use)
nacl/amd64p32 defines a machine with 64bit registers, but address space is limited to a 4gb window (the window is placed randomly inside the full 48 bit virtual address space of a process). To cope with this 6c defines _64BIT and _64BITREG.
_64BITREG is always defined by 6c, so both GOARCH=amd64 and GOARCH=amd64p32 use 64bit wide registers.
However _64BIT itself is only defined when 6c is compiling for amd64 targets. The definition is elided for amd64p32 environments causing int, uint and other arch specific types to revert to their 32bit definitions.
LGTM=iant
R=iant, rsc, remyoudompheng
CC=golang-codereviews
https://golang.org/cl/72860046
There are at least 3 bugs:
1. g->stacksize accounting is broken during copystack/shrinkstack
2. stktop->free is not properly maintained during copystack/shrinkstack
3. stktop->free logic is broken:
we can have stktop->free==FixedStack,
and we will free it into stack cache,
but it actually comes from heap as the result of non-copying segment shrink
This shows as at least spurious races on race builders (maybe something else as well I don't know).
The idea behind the refactoring is to consolidate stacksize and
segment origin logic in stackalloc/stackfree.
Fixes#7490.
LGTM=rsc, khr
R=golang-codereviews, rsc, khr
CC=golang-codereviews
https://golang.org/cl/72440043
Recursive panics leave dangling Panic structs in g->panic stack.
At best it leads to a Defer leak and incorrect output on a subsequent panic.
At worst it arbitrary corrupts heap.
LGTM=rsc
R=rsc
CC=golang-codereviews
https://golang.org/cl/72480043
Implement custom assembly thunks for hot race calls (memory accesses and function entry/exit).
The thunks extract caller pc, verify that the address is in heap or global and switch to g0 stack.
Before:
ok regexp 3.692s
ok compress/bzip2 9.461s
ok encoding/json 6.380s
After:
ok regexp 2.229s (-40%)
ok compress/bzip2 4.703s (-50%)
ok encoding/json 3.629s (-43%)
For comparison, normal non-race build:
ok regexp 0.348s
ok compress/bzip2 0.304s
ok encoding/json 0.661s
Race build:
ok regexp 2.229s (+540%)
ok compress/bzip2 4.703s (+1447%)
ok encoding/json 3.629s (+449%)
Also removes some race-related special cases from cgocall and scheduler.
In long-term it will allow to remove cyclic runtime/race dependency on cmd/cgo.
Fixes#4249.
Fixes#7460.
Update #6508
Update #6688
R=iant, rsc, bradfitz
CC=golang-codereviews
https://golang.org/cl/55100044
32-bit Windows uses "structured exception handling" (SEH) to
handle hardware faults: that there is a per-thread linked list
of fault handlers maintained in user space instead of
something like Unix's signal handlers. The structures in the
linked list are required to live on the OS stack, and the
usual discipline is that the function that pushes a record
(allocated from the current stack frame) onto the list pops
that record before returning. Not to pop the entry before
returning creates a dangling pointer error: the list head
points to a stack frame that no longer exists.
Go pushes an SEH record in the top frame of every OS thread,
and that record suffices for all Go execution on that thread,
at least until cgo gets involved.
If we call into C using cgo, that called C code may push its
own SEH records, but by the convention it must pop them before
returning back to the Go code. We assume it does, and that's
fine.
If the C code calls back into Go, we want the Go SEH handler
to become active again, not whatever C has set up. So
runtime.callbackasm1, which handles a call from C back into
Go, pushes a new SEH record before calling the Go code and
pops it when the Go code returns. That's also fine.
It can happen that when Go calls C calls Go like this, the
inner Go code panics. We allow a defer in the outer Go to
recover the panic, effectively wiping not only the inner Go
frames but also the C calls. This sequence was not popping the
SEH stack up to what it was before the cgo calls, so it was
creating the dangling pointer warned about above. When
eventually the m stack was used enough to overwrite the
dangling SEH records, the SEH chain was lost, and any future
panic would not end up in Go's handler.
The bug in TestCallbackPanic and friends was thus creating a
situation where TestSetPanicOnFault - which causes a hardware
fault - would not find the Go fault handler and instead crash
the binary.
Add checks to TestCallbackPanicLocked to diagnose the mistake
in that test instead of leaving a bad state for another test
case to stumble over.
Fix bug by restoring SEH chain during deferred "endcgo"
cleanup.
This bug is likely present in Go 1.2.1, but since it depends
on Go calling C calling Go, with the inner Go panicking and
the outer Go recovering the panic, it seems not important
enough to bother fixing before Go 1.3. Certainly no one has
complained.
Fixes#7470.
LGTM=alex.brainman
R=golang-codereviews, alex.brainman
CC=golang-codereviews, iant, khr
https://golang.org/cl/71440043
On stack overflow, if all frames on the stack are
copyable, we copy the frames to a new stack twice
as large as the old one. During GC, if a G is using
less than 1/4 of its stack, copy the stack to a stack
half its size.
TODO
- Do something about C frames. When a C frame is in the
stack segment, it isn't copyable. We allocate a new segment
in this case.
- For idempotent C code, we can abort it, copy the stack,
then retry. I'm working on a separate CL for this.
- For other C code, we can raise the stackguard
to the lowest Go frame so the next call that Go frame
makes triggers a copy, which will then succeed.
- Pick a starting stack size?
The plan is that eventually we reach a point where the
stack contains only copyable frames.
LGTM=rsc
R=dvyukov, rsc
CC=golang-codereviews
https://golang.org/cl/54650044
MCaches now hold a MSpan for each sizeclass which they have
exclusive access to allocate from, so no lock is needed.
Modifying the heap bitmaps also no longer requires a cas.
runtime.free gets more expensive. But we don't use it
much any more.
It's not much faster on 1 processor, but it's a lot
faster on multiple processors.
benchmark old ns/op new ns/op delta
BenchmarkSetTypeNoPtr1 24 23 -0.42%
BenchmarkSetTypeNoPtr2 33 34 +0.89%
BenchmarkSetTypePtr1 51 49 -3.72%
BenchmarkSetTypePtr2 55 54 -1.98%
benchmark old ns/op new ns/op delta
BenchmarkAllocation 52739 50770 -3.73%
BenchmarkAllocation-2 33957 34141 +0.54%
BenchmarkAllocation-3 33326 29015 -12.94%
BenchmarkAllocation-4 38105 25795 -32.31%
BenchmarkAllocation-5 68055 24409 -64.13%
BenchmarkAllocation-6 71544 23488 -67.17%
BenchmarkAllocation-7 68374 23041 -66.30%
BenchmarkAllocation-8 70117 20758 -70.40%
LGTM=rsc, dvyukov
R=dvyukov, bradfitz, khr, rsc
CC=golang-codereviews
https://golang.org/cl/46810043
These are mistakes in the first big NaCl CL.
LGTM=minux.ma, iant
R=golang-codereviews, minux.ma, iant
CC=golang-codereviews
https://golang.org/cl/69200043
See golang.org/s/go13nacl for design overview.
This CL is the mostly mechanical changes from rsc's Go 1.2 based NaCl branch, specifically 39cb35750369 to 500771b477cf from https://code.google.com/r/rsc-go13nacl. This CL does not include working NaCl support, there are probably two or three more large merges to come.
CL 15750044 is not included as it involves more invasive changes to the linker which will need to be merged separately.
The exact change lists included are
15050047: syscall: support for Native Client
15360044: syscall: unzip implementation for Native Client
15370044: syscall: Native Client SRPC implementation
15400047: cmd/dist, cmd/go, go/build, test: support for Native Client
15410048: runtime: support for Native Client
15410049: syscall: file descriptor table for Native Client
15410050: syscall: in-memory file system for Native Client
15440048: all: update +build lines for Native Client port
15540045: cmd/6g, cmd/8g, cmd/gc: support for Native Client
15570045: os: support for Native Client
15680044: crypto/..., hash/crc32, reflect, sync/atomic: support for amd64p32
15690044: net: support for Native Client
15690048: runtime: support for fake time like on Go Playground
15690051: build: disable various tests on Native Client
LGTM=rsc
R=rsc
CC=golang-codereviews
https://golang.org/cl/68150047
SetPanicOnFault allows recovery from unexpected memory faults.
This can be useful if you are using a memory-mapped file
or probing the address space of the current program.
LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/66590044
Package runtime's C functions written to be called from Go
started out written in C using carefully constructed argument
lists and the FLUSH macro to write a result back to memory.
For some functions, the appropriate parameter list ended up
being architecture-dependent due to differences in alignment,
so we added 'goc2c', which takes a .goc file containing Go func
declarations but C bodies, rewrites the Go func declaration to
equivalent C declarations for the target architecture, adds the
needed FLUSH statements, and writes out an equivalent C file.
That C file is compiled as part of package runtime.
Native Client's x86-64 support introduces the most complex
alignment rules yet, breaking many functions that could until
now be portably written in C. Using goc2c for those avoids the
breakage.
Separately, Keith's work on emitting stack information from
the C compiler would require the hand-written functions
to add #pragmas specifying how many arguments are result
parameters. Using goc2c for those avoids maintaining #pragmas.
For both reasons, use goc2c for as many Go-called C functions
as possible.
This CL is a replay of the bulk of CL 15400047 and CL 15790043,
both of which were reviewed as part of the NaCl port and are
checked in to the NaCl branch. This CL is part of bringing the
NaCl code into the main tree.
No new code here, just reformatting and occasional movement
into .h files.
LGTM=r
R=dave, alex.brainman, r
CC=golang-codereviews
https://golang.org/cl/65220044
Current "System->etext" is not very informative.
Add parent "GC" frame.
Replace un-unwindable syscall/cgo frames with Go stack that leads to the call.
LGTM=rsc
R=rsc, alex.brainman, ality
CC=golang-codereviews
https://golang.org/cl/61270043
Remove GOOS_solaris ifdef from netpoll code,
instead introduce runtime edge/level triggered IO flag.
Replace armread/armwrite with a single arm(mode) function,
that's how all other interfaces look like and these functions
will need to do roughly the same thing anyway.
LGTM=rsc
R=golang-codereviews, dave, rsc
CC=golang-codereviews
https://golang.org/cl/55500044
1. Make internal chan functions static.
2. Move selgen local variable instead of a member of G struct.
3. Change "bool *pres/selected" parameter of chansend/chanrecv to "bool block",
which is simpler, faster and less code.
-37 lines total.
LGTM=rsc
R=golang-codereviews, dave, gobot, rsc
CC=bradfitz, golang-codereviews, iant, khr
https://golang.org/cl/58610043
Currently windows crashes because early allocs in schedinit
try to allocate tiny memory blocks, but m->p is not yet setup.
I've considered calling procresize(1) earlier in schedinit,
but this refactoring is better and must fix the issue as well.
Fixes#7218.
R=golang-codereviews, r
CC=golang-codereviews
https://golang.org/cl/54570045
Introduces two-phase goroutine parking mechanism -- prepare to park, commit park.
This mechanism does not require backing mutex to protect wait predicate.
Use it in netpoll. See comment in netpoll.goc for details.
This slightly reduces contention between reader, writer and read/write io notifications;
and just eliminates a bunch of mutex operations from hotpaths, thus making then faster.
benchmark old ns/op new ns/op delta
BenchmarkTCP4ConcurrentReadWrite 2109 1945 -7.78%
BenchmarkTCP4ConcurrentReadWrite-2 1162 1113 -4.22%
BenchmarkTCP4ConcurrentReadWrite-4 798 755 -5.39%
BenchmarkTCP4ConcurrentReadWrite-8 803 748 -6.85%
BenchmarkTCP4Persistent 9411 9240 -1.82%
BenchmarkTCP4Persistent-2 5888 5813 -1.27%
BenchmarkTCP4Persistent-4 4016 3968 -1.20%
BenchmarkTCP4Persistent-8 3943 3857 -2.18%
R=golang-codereviews, mikioh.mikioh, gobot, iant, rsc
CC=golang-codereviews, khr
https://golang.org/cl/45700043
- do not lose profiling signals when we have no mcache (possible for syscalls/cgo)
- do not lose any profiling signals on windows
- fix profiling of cgo programs on windows (they had no m->thread setup)
- properly setup tls in cgo programs on windows
- check _beginthread return value
Fixes#6417.
Fixes#6986.
R=alex.brainman, rsc
CC=golang-codereviews
https://golang.org/cl/44820047
Currently we collect (add) all roots into a global array in a single-threaded GC phase.
This hinders parallelism.
With this change we just kick off parallel for for number_of_goroutines+5 iterations.
Then parallel for callback decides whether it needs to scan stack of a goroutine
scan data segment, scan finalizers, etc. This eliminates the single-threaded phase entirely.
This requires to store all goroutines in an array instead of a linked list
(to allow direct indexing).
This CL also removes DebugScan functionality. It is broken because it uses
unbounded stack, so it can not run on g0. When it was working, I've found
it helpless for debugging issues because the two algorithms are too different now.
This change would require updating the DebugScan, so it's simpler to just delete it.
With 8 threads this change reduces GC pause by ~6%, while keeping cputime roughly the same.
garbage-8
allocated 2987886 2989221 +0.04%
allocs 62885 62887 +0.00%
cputime 21286000 21272000 -0.07%
gc-pause-one 26633247 24885421 -6.56%
gc-pause-total 873570 811264 -7.13%
rss 242089984 242515968 +0.18%
sys-gc 13934336 13869056 -0.47%
sys-heap 205062144 205062144 +0.00%
sys-other 12628288 12628288 +0.00%
sys-stack 11534336 11927552 +3.41%
sys-total 243159104 243487040 +0.13%
time 2809477 2740795 -2.44%
R=golang-codereviews, rsc
CC=cshapiro, golang-codereviews, khr
https://golang.org/cl/46860043
Instead of a per-goroutine stack of defers for all sizes,
introduce per-P defer pool for argument sizes 8, 24, 40, 56, 72 bytes.
For a program that starts 1e6 goroutines and then joins then:
old: rss=6.6g virtmem=10.2g time=4.85s
new: rss=4.5g virtmem= 8.2g time=3.48s
R=golang-codereviews, rsc
CC=golang-codereviews
https://golang.org/cl/42750044
Example of output:
goroutine 4 [sleep for 3 min]:
time.Sleep(0x34630b8a000)
src/pkg/runtime/time.goc:31 +0x31
main.func·002()
block.go:16 +0x2c
created by main.main
block.go:17 +0x33
Full program and output are here:
http://play.golang.org/p/NEZdADI3TdFixes#6809.
R=golang-codereviews, khr, kamil.kisiel, bradfitz, rsc
CC=golang-codereviews
https://golang.org/cl/50420043
Use lock-free fixed-size ring for work queues
instead of an unbounded mutex-protected array.
The ring has single producer and multiple consumers.
If the ring overflows, work is put onto global queue.
benchmark old ns/op new ns/op delta
BenchmarkMatmult 7 5 -18.12%
BenchmarkMatmult-4 2 2 -18.98%
BenchmarkMatmult-16 1 0 -12.84%
BenchmarkCreateGoroutines 105 88 -16.10%
BenchmarkCreateGoroutines-4 376 219 -41.76%
BenchmarkCreateGoroutines-16 241 174 -27.80%
BenchmarkCreateGoroutinesParallel 103 87 -14.66%
BenchmarkCreateGoroutinesParallel-4 169 143 -15.38%
BenchmarkCreateGoroutinesParallel-16 158 151 -4.43%
R=golang-codereviews, rsc
CC=ddetlefs, devon.odell, golang-codereviews
https://golang.org/cl/46170044
record finalizers and heap profile info. Enables
removing the special bit from the heap bitmap. Also
provides a generic mechanism for annotating occasional
heap objects.
finalizers
overhead per obj
old 680 B 80 B avg
new 16 B/span 48 B
profile
overhead per obj
old 32KB 24 B + hash tables
new 16 B/span 24 B
R=cshapiro, khr, dvyukov, gobot
CC=golang-codereviews
https://golang.org/cl/13314053
When enabled this new debugging mode will allocate objects on
their own page and never recycle memory addresses. This is an
essential tool to root cause a broad class of heap corruption.
R=golang-dev, dave, daniel.morsing, dvyukov, rsc, iant, cshapiro
CC=golang-dev
https://golang.org/cl/22060046
This change is part of the plan to get rid of all vararg C calls
which are a pain for getting exact stack scanning.
We allocate a chunk of zero memory to return a pointer to when a
map access doesn't find the key. This is simpler than returning nil
and fixing things up in the caller. Linker magic allocates a single
zero memory area that is shared by all (non-reflect-generated) map
types.
Passing things by reference gets rid of some copies, so it speeds
up code with big keys/values.
benchmark old ns/op new ns/op delta
BenchmarkBigKeyMap 34 31 -8.48%
BenchmarkBigValMap 37 30 -18.62%
BenchmarkSmallKeyMap 26 23 -11.28%
R=golang-dev, dvyukov, khr, rsc
CC=golang-dev
https://golang.org/cl/14794043