Currently the execution tracer synchronizes with itself using very
heavyweight operations. As a result, it's totally fine for most of the
tracer code to look like:
if traceEnabled() {
traceXXX(...)
}
However, if we want to make that synchronization more lightweight (as
issue #60773 proposes), then this is insufficient. In particular, we
need to make sure the tracer can't observe an inconsistency between g
atomicstatus and the event that would be emitted for a particular
g transition. This means making the g status change appear to happen
atomically with the corresponding trace event being written out from the
perspective of the tracer.
This requires a change in API to something more like a lock. While we're
here, we might as well make sure that trace events can *only* be emitted
while this lock is held. This change introduces such an API:
traceAcquire, which returns a value that can emit events, and
traceRelease, which requires the value that was returned by
traceAcquire. In practice, this won't be a real lock, it'll be more like
a seqlock.
For the current tracer, this API is completely overkill and the value
returned by traceAcquire basically just checks trace.enabled. But it's
necessary for the tracer described in #60773 and we can implement that
more cleanly if we do this refactoring now instead of later.
For #60773.
Change-Id: Ibb9ff5958376339fafc2b5180aef65cf2ba18646
Reviewed-on: https://go-review.googlesource.com/c/go/+/515635
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
mspan.freeindex and nelems can fit into uint16 for all possible
values. Use uint16 instead of uintptr.
Change-Id: Ifce20751e81d5022be1f6b5cbb5fbe4fd1728b1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/451359
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This reverts commit bed2b7cf41.
Reason for revert: I clicked submit by accident on the wrong CL.
Change-Id: Iddf128cb62f289d472510eb30466e515068271b2
Reviewed-on: https://go-review.googlesource.com/c/go/+/449501
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
When the GC is scanning some memory (possibly conservatively),
finding a pointer, while concurrently another goroutine is
allocating an object at the same address as the found pointer, the
GC may see the pointer before the object and/or the heap bits are
initialized. This may cause the GC to see bad pointers and
possibly crash.
To prevent this, we make it that the scanner can only see the
object as allocated after the object and the heap bits are
initialized. As the scanner uses the freeindex to determine if an
object is allocated, we delay the increment of freeindex after the
initialization.
As currently in some code path finding the next free index and
updating the free index to a new slot past it is coupled, this
needs a small refactoring. In the new code mspan.nextFreeIndex
return the next free index but not update it (although allocCache
is updated). mallocgc will update it at a later time.
Fixes#54596.
Change-Id: I6dd5ccf743f2d2c46a1ed67c6a8237fe09a71260
Reviewed-on: https://go-review.googlesource.com/c/go/+/427619
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
This change adds an API to the runtime for arenas. A later CL can
potentially export it as an experimental API, but for now, just the
runtime implementation will suffice.
The purpose of arenas is to improve efficiency, primarily by allowing
for an application to manually free memory, thereby delaying garbage
collection. It comes with other potential performance benefits, such as
better locality, a better allocation strategy, and better handling of
interior pointers by the GC.
This implementation is based on one by danscales@google.com with a few
significant differences:
* The implementation lives entirely in the runtime (all layers).
* Arena chunks are the minimum of 8 MiB or the heap arena size. This
choice is made because in practice 64 MiB appears to be way too large
of an area for most real-world use-cases.
* Arena chunks are not unmapped, instead they're placed on an evacuation
list and when there are no pointers left pointing into them, they're
allowed to be reused.
* Reusing partially-used arena chunks no longer tries to find one used
by the same P first; it just takes the first one available.
* In order to ensure worst-case fragmentation is never worse than 25%,
only types and slice backing stores whose sizes are 1/4th the size of
a chunk or less may be used. Previously larger sizes, up to the size
of the chunk, were allowed.
* ASAN, MSAN, and the race detector are fully supported.
* Sets arena chunks to fault that were deferred at the end of mark
termination (a non-public patch once did this; I don't see a reason
not to continue that).
For #51317.
Change-Id: I83b1693a17302554cb36b6daa4e9249a81b1644f
Reviewed-on: https://go-review.googlesource.com/c/go/+/423359
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
[this is a retry of CL 407035 + its revert CL 422395. The content is unchanged]
Use just 1 bit per word to record the ptr/nonptr bitmap.
Use word-sized operations to manipulate the bitmap, so we can operate
on up to 64 ptr/nonptr bits at a time.
Use a separate bitmap, one bit per word of the ptr/nonptr bitmap,
to encode a no-more-pointers signal. Since we can check 64 ptr/nonptr
bits at once, knowing the exact last pointer location is not necessary.
As a followon CL, we should make the gcdata bitmap an array of
uintptr instead of an array of byte, so we can load 64 bits of it at once.
Similarly for the processing of gc programs.
Change-Id: Ica5eb622f5b87e647be64f471d67b02732ef8be6
Reviewed-on: https://go-review.googlesource.com/c/go/+/422634
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
This reverts commit b589208c8c.
Reason for revert: Bug somewhere in this code, causing wasm and maybe linux/386 to fail.
Change-Id: I5e1e501d839584e0219271bb937e94348f83c11f
Reviewed-on: https://go-review.googlesource.com/c/go/+/422395
Reviewed-by: Than McIntosh <thanm@google.com>
Run-TryBot: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Use just 1 bit per word to record the ptr/nonptr bitmap.
Use word-sized operations to manipulate the bitmap, so we can operate
on up to 64 ptr/nonptr bits at a time.
Use a separate bitmap, one bit per word of the ptr/nonptr bitmap,
to encode a no-more-pointers signal. Since we can check 64 ptr/nonptr
bits at once, knowing the exact last pointer location is not necessary.
This cleans up the bitmap implementation significantly, which will
hopefully make it faster. TODO: measure
As a followon CL, we should make the gcdata bitmap an array of
uintptr instead of an array of byte, so we can load 64 bits of it at once.
Similarly for the processing of gc programs.
Change-Id: I18151b1876d9543599800dec51e2a1b19df97d49
Reviewed-on: https://go-review.googlesource.com/c/go/+/407035
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
CL 333389 erroneously moved traceGCSweepDone inside the sl.valid block
that it introduced in mcentral.cacheSpan, when it should have left it
outside that scope, because the trace event is created unconditionally
at the top of the method.
Fixes#49231.
Change-Id: If719c05063d4f4d330a320da6dc00d1e9d8102e4
Reviewed-on: https://go-review.googlesource.com/c/go/+/359775
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Currently, the runtime zeroes allocations in several ways. First, small
object spans are always zeroed if they come from mheap, and their slots
are zeroed later in mallocgc if needed. Second, large object spans
(objects that have their own spans) plumb the need for zeroing down into
mheap. Thirdly, large objects that have no pointers have their zeroing
delayed until after preemption is reenabled, but before returning in
mallocgc.
All of this has two consequences:
1. Spans for small objects that come from mheap are sometimes
unnecessarily zeroed, even if the mallocgc call that created them
doesn't need the object slot to be zeroed.
2. This is all messy and difficult to reason about.
This CL simplifies this code, resolving both (1) and (2). First, it
recognizes that zeroing in mheap is unnecessary for small object spans;
mallocgc and its callees in mcache and mcentral, by design, are *always*
able to deal with non-zeroed spans. They must, for they deal with
recycled spans all the time. Once this fact is made clear, the only
remaining use of zeroing in mheap is for large objects.
As a result, this CL lifts mheap zeroing for large objects into
mallocgc, to parallel all the other codepaths in mallocgc. This is makes
the large object allocation code less surprising.
Next, this CL sets the flag for the delayed zeroing explicitly in the one
case where it matters, and inverts and renames the flag from isZeroed to
delayZeroing.
Finally, it adds a check to make sure that only pointer-free allocations
take the delayed zeroing codepath, as an extra safety measure.
Benchmark results: https://perf.golang.org/search?q=upload:20211028.8
Inspired by tapir.liu@gmail.com's CL 343470.
Change-Id: I7e1296adc19ce8a02c8d93a0a5082aefb2673e8f
Reviewed-on: https://go-review.googlesource.com/c/go/+/359477
Trust: Michael Knyszek <mknyszek@google.com>
Reviewed-by: David Chase <drchase@google.com>
Currently, there is a chance that the sweep termination condition could
flap, causing e.g. runtime.GC to return before all sweep work has not
only been drained, but also completed. CL 307915 and CL 307916 attempted
to fix this problem, but it is still possible that mheap_.sweepDrained is
marked before any outstanding sweepers are accounted for in
mheap_.sweepers, leaving a window in which a thread could observe
isSweepDone as true before it actually was (and after some time it would
revert to false, then true again, depending on the number of outstanding
sweepers at that point).
This change fixes the sweep termination condition by merging
mheap_.sweepers and mheap_.sweepDrained into a single atomic value.
This value is updated such that a new potential sweeper will increment
the oustanding sweeper count iff there are still outstanding spans to be
swept without an outstanding sweeper to pick them up. This design
simplifies the sweep termination condition into a single atomic load and
comparison and ensures the condition never flaps.
Updates #46500.
Fixes#45315.
Change-Id: I6d69aff156b8d48428c4cc8cfdbf28be346dbf04
Reviewed-on: https://go-review.googlesource.com/c/go/+/333389
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
If something "huge" is allocated, and the zeroing is trivial (no pointers
involved) then zero it by chunks in a loop so that preemption can occur,
not all in a single non-preemptible call.
Benchmarking suggests that 256K is the best chunk size.
Updates #42642.
Change-Id: I94015e467eaa098c59870e479d6d83bc88efbfb4
Reviewed-on: https://go-review.googlesource.com/c/go/+/270943
Trust: David Chase <drchase@google.com>
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
The runtime currently has two different notions of sweep completion:
1. All spans are either swept or have begun sweeping.
2. The sweeper has *finished* sweeping all spans.
Most things depend on condition 1. Notably, GC correctness depends on
condition 1, but since all sweep operations a non-preemptible, the STW
at the beginning of GC forces condition 1 to become condition 2.
runtime.GC(), however, depends on condition 2, since the intent is to
complete a complete GC cycle, and also update the heap profile (which
can only be done after sweeping is complete).
However, the way we compute condition 2 is racy right now and may in
fact only indicate condition 1. Specifically, sweepone blocks
condition 2 until all sweepone calls are done, but there are many
other ways to enter the sweeper that don't block this. Hence, sweepone
may see that there are no more spans in the sweep list and see that
it's the last sweepone and declare sweeping done, while there's some
other sweeper still working on a span.
Fix this by making sure every entry to the sweeper participates in the
protocol that blocks condition 2. To make sure we get this right, this
CL introduces a type to track sweep blocking and (lightly) enforces
span sweep ownership via the type system. This has the nice
side-effect of abstracting the pattern of acquiring sweep ownership
that's currently repeated in many different places.
Fixes#45315.
Change-Id: I7fab30170c5ae14c8b2f10998628735b8be6d901
Reviewed-on: https://go-review.googlesource.com/c/go/+/307915
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
This change removes mcentral.nmalloc and adds mcache.local_nsmallalloc
which fulfills the same role but may be accessed non-atomically. It also
moves responsibility for updating heap_live and local_nsmallalloc into
mcache functions.
As a result of this change, mcache is now the sole source-of-truth for
malloc stats. It is also solely responsible for updating heap_live and
performing the various operations required as a result of updating
heap_live. The overall improvement here is in code organization:
previously malloc stats were fairly scattered, and now they have one
single home, and nearly all the required manipulations exist in a single
file.
Change-Id: I7e93fa297c1debf17e3f2a0d68aeed28a9c6af00
Reviewed-on: https://go-review.googlesource.com/c/go/+/246966
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
This change deletes the old mcentral implementation from the code base
and the newMCentralImpl feature flag along with it.
Updates #37487.
Change-Id: Ibca8f722665f0865051f649ffe699cbdbfdcfcf2
Reviewed-on: https://go-review.googlesource.com/c/go/+/221184
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Currently, the small object sweeper will sweep until it finds a free
slot or there are no more spans of that size class to sweep. In dense
heaps, this can cause sweeping for a given size class to take
unbounded time, and gets worse with larger heaps.
This CL limits the small object sweeper to try at most 100 spans
before giving up and allocating a fresh span. Since it's already shown
that 100 spans are completely full at that point, the space overhead
of this fresh span is at most 1%.
This CL is based on an experimental CL by Austin Clements (CL 187817)
and is updated to be part of the mcentral implementation, gated by
go115NewMCentralImpl.
Updates #18155.
Change-Id: I37a72c2dcc61dd6f802d1d0eac3683e6642b6ef8
Reviewed-on: https://go-review.googlesource.com/c/go/+/229998
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Austin Clements <austin@google.com>
Currently mcentral is implemented as a couple of linked lists of spans
protected by a lock. Unfortunately this design leads to significant lock
contention.
The span ownership model is also confusing and complicated. In-use spans
jump between being owned by multiple sources, generally some combination
of a gcSweepBuf, a concurrent sweeper, an mcentral or an mcache.
So first to address contention, this change replaces those linked lists
with gcSweepBufs which have an atomic fast path. Then, we change up the
ownership model: a span may be simultaneously owned only by an mcentral
and the page reclaimer. Otherwise, an mcentral (which now consists of
sweep bufs), a sweeper, or an mcache are the sole owners of a span at
any given time. This dramatically simplifies reasoning about span
ownership in the runtime.
As a result of this new ownership model, sweeping is now driven by
walking over the mcentrals rather than having its own global list of
spans. Because we no longer have a global list and we traditionally
haven't used the mcentrals for large object spans, we no longer have
anywhere to put large objects. So, this change also makes it so that we
keep large object spans in the appropriate mcentral lists.
In terms of the static lock ranking, we add the spanSet spine locks in
pretty much the same place as the mcentral locks, since they have the
potential to be manipulated both on the allocation and sweep paths, like
the mcentral locks.
This new implementation is turned on by default via a feature flag
called go115NewMCentralImpl.
Benchmark results for 1 KiB allocation throughput (5 runs each):
name \ MiB/s go113 go114 gotip gotip+this-patch
AllocKiB-1 1.71k ± 1% 1.68k ± 1% 1.59k ± 2% 1.71k ± 1%
AllocKiB-2 2.46k ± 1% 2.51k ± 1% 2.54k ± 1% 2.93k ± 1%
AllocKiB-4 4.27k ± 1% 4.41k ± 2% 4.33k ± 1% 5.01k ± 2%
AllocKiB-8 4.38k ± 3% 5.24k ± 1% 5.46k ± 1% 8.23k ± 1%
AllocKiB-12 4.38k ± 3% 4.49k ± 1% 5.10k ± 1% 10.04k ± 0%
AllocKiB-16 4.31k ± 1% 4.14k ± 3% 4.22k ± 0% 10.42k ± 0%
AllocKiB-20 4.26k ± 1% 3.98k ± 1% 4.09k ± 1% 10.46k ± 3%
AllocKiB-24 4.20k ± 1% 3.97k ± 1% 4.06k ± 1% 10.74k ± 1%
AllocKiB-28 4.15k ± 0% 4.00k ± 0% 4.20k ± 0% 10.76k ± 1%
Fixes#37487.
Change-Id: I92d47355acacf9af2c41bf080c08a8c1638ba210
Reviewed-on: https://go-review.googlesource.com/c/go/+/221182
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
I took some of the infrastructure from Austin's lock logging CR
https://go-review.googlesource.com/c/go/+/192704 (with deadlock
detection from the logs), and developed a setup to give static lock
ranking for runtime locks.
Static lock ranking establishes a documented total ordering among locks,
and then reports an error if the total order is violated. This can
happen if a deadlock happens (by acquiring a sequence of locks in
different orders), or if just one side of a possible deadlock happens.
Lock ordering deadlocks cannot happen as long as the lock ordering is
followed.
Along the way, I found a deadlock involving the new timer code, which Ian fixed
via https://go-review.googlesource.com/c/go/+/207348, as well as two other
potential deadlocks.
See the constants at the top of runtime/lockrank.go to show the static
lock ranking that I ended up with, along with some comments. This is
great documentation of the current intended lock ordering when acquiring
multiple locks in the runtime.
I also added an array lockPartialOrder[] which shows and enforces the
current partial ordering among locks (which is embedded within the total
ordering). This is more specific about the dependencies among locks.
I don't try to check the ranking within a lock class with multiple locks
that can be acquired at the same time (i.e. check the ranking when
multiple hchan locks are acquired).
Currently, I am doing a lockInit() call to set the lock rank of most
locks. Any lock that is not otherwise initialized is assumed to be a
leaf lock (a very high rank lock), so that eliminates the need to do
anything for a bunch of locks (including all architecture-dependent
locks). For two locks, root.lock and notifyList.lock (only in the
runtime/sema.go file), it is not as easy to do lock initialization, so
instead, I am passing the lock rank with the lock calls.
For Windows compilation, I needed to increase the StackGuard size from
896 to 928 because of the new lock-rank checking functions.
Checking of the static lock ranking is enabled by setting
GOEXPERIMENT=staticlockranking before doing a run.
To make sure that the static lock ranking code has no overhead in memory
or CPU when not enabled by GOEXPERIMENT, I changed 'go build/install' so
that it defines a build tag (with the same name) whenever any experiment
has been baked into the toolchain (by checking Expstring()). This allows
me to avoid increasing the size of the 'mutex' type when static lock
ranking is not enabled.
Fixes#38029
Change-Id: I154217ff307c47051f8dae9c2a03b53081acd83a
Reviewed-on: https://go-review.googlesource.com/c/go/+/207619
Reviewed-by: Dan Scales <danscales@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Dan Scales <danscales@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
mheap_.alloc currently accepts both a spanClass and a "large" parameter
indicating whether the allocation is large. These are redundant, since
spanClass.sizeclass() == 0 is an equivalent way to determine this and is
already used in mheap_.alloc. There are no places in the runtime where
the size class could be non-zero and large == true.
Updates #35112.
Change-Id: Ie66facf8f0faca6f4cd3d20a8ac4bc259e11823d
Reviewed-on: https://go-review.googlesource.com/c/go/+/196639
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
This change removes useless additional heap_objects accounting for large
objects. heap_objects is computed from scratch at ReadMemStats time
(which stops the world) by using nlargealloc and nlargefree, so mutating
heap_objects turns out to be pointless.
As a result, the "large" parameter on "mheap_.freeSpan" is no longer
necessary and so this change cleans that up too.
Change-Id: I7d6b486d9b57c018e3db46221d81b55fe4c1b021
Reviewed-on: https://go-review.googlesource.com/c/go/+/196637
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Divisions are generally slow. The compiler can optimize a division
to use a sequence of faster multiplies and shifts (magic constants)
if the divisor is not know at compile time.
The value of the span element size in mcentral.grow is not known at
compile time but magic constants to compute n / span.elementsize
are already stored in class_to_divmagic and mspan.
They however need to be adjusted to work for
(0 <= n <= span.npages * pagesize) instead of
(0 <= n < span.npages * pagesize).
Change-Id: Ieea59f1c94525a88d012f2557d43691967900deb
Reviewed-on: https://go-review.googlesource.com/c/go/+/148057
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
This change cleans up references to MSpan, MCache, and MCentral in the
docs via a bunch of sed invocations to better reflect the Go names for
the equivalent structures (i.e. mspan, mcache, mcentral) and their
methods (i.e. MSpan_Sweep -> mspan.sweep).
Change-Id: Ie911ac975a24bd25200a273086dd835ab78b1711
Reviewed-on: https://go-review.googlesource.com/c/147557
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Go documentation style for boolean funcs is to say:
// Foo reports whether ...
func Foo() bool
(rather than "returns true if")
This CL also replaces 4 uses of "iff" with the same "reports whether"
wording, which doesn't lose any meaning, and will prevent people from
sending typo fixes when they don't realize it's "if and only if". In
the past I think we've had the typo CLs updated to just say "reports
whether". So do them all at once.
(Inspired by the addition of another "returns true if" in CL 146938
in fd_plan9.go)
Created with:
$ perl -i -npe 's/returns true if/reports whether/' $(git grep -l "returns true iff" | grep -v vendor)
$ perl -i -npe 's/returns true if/reports whether/' $(git grep -l "returns true if" | grep -v vendor)
Change-Id: Ided502237f5ab0d25cb625dbab12529c361a8b9f
Reviewed-on: https://go-review.googlesource.com/c/147037
Reviewed-by: Ian Lance Taylor <iant@golang.org>
For unclear reasons, cacheSpan and uncacheSpan compute the number of
elements in a span by dividing its size by the element size. This
number is simply available in the mspan structure, so just use it.
Change-Id: If2e5de6ecec39befd3324bf1da4a275ad000932f
Reviewed-on: https://go-review.googlesource.com/c/138656
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Lazy mcache flushing (golang.org/cl/134783) made it so that moving a
span from an mcache to an mcentral was sometimes responsible for
sweeping the span. However, it did a "preserving" sweep, which meant
it retained ownership, even if the sweeper swept all objects in the
span. As a result, we could put a completely unused span back in the
mcentral.
Fix this by first taking back ownership of the span into the mcentral
and moving it to the right mcentral list, and then doing a
non-preserving sweep. The non-preserving sweep will move the span to
the heap if it sweeps all objects.
Change-Id: I244b1893b44b8c00264f0928ac9239449775f617
Reviewed-on: https://go-review.googlesource.com/c/140597
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
freeSpan currently takes a mysterious "acct int32" argument. This is
really just a boolean and actually just needs to match the "large"
argument to alloc in order to balance out accounting.
To make this clearer, replace acct with a "large bool" argument that
must match the call to mheap.alloc.
Change-Id: Ibc81faefdf9f0583114e1953fcfb362e9c3c76de
Reviewed-on: https://go-review.googlesource.com/c/138655
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Currently, all mcaches are flushed during STW mark termination as a
root marking job. This is currently necessary because all spans must
be out of these caches before sweeping begins to avoid races with
allocation and to ensure the spans are in the state expected by
sweeping. We do it as a root marking job because mcache flushing is
somewhat expensive and O(GOMAXPROCS) and this parallelizes the work
across the Ps. However, it's also the last remaining root marking job
performed during mark termination.
This CL moves mcache flushing out of mark termination and performs it
lazily. We keep track of the last sweepgen at which each mcache was
flushed and as each P is woken from STW, it observes that its mcache
is out-of-date and flushes it.
The introduces a complication for spans cached in stale mcaches. These
may now be observed by background or proportional sweeping or when
attempting to add a finalizer, but aren't in a stable state. For
example, they are likely to be on the wrong mcentral list. To fix
this, this CL extends the sweepgen protocol to also capture whether a
span is cached and, if so, whether or not its cache is stale. This
protocol blocks asynchronous sweeping from touching cached spans and
makes it the responsibility of mcache flushing to sweep the flushed
spans.
This eliminates the last mark termination root marking job, which
means we can now eliminate that entire infrastructure.
Updates #26903. This implements lazy mcache flushing.
Change-Id: Iadda7aabe540b2026cffc5195da7be37d5b4125e
Reviewed-on: https://go-review.googlesource.com/c/134783
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
This replaces all uses of the mheap_.arena_* fields outside of
mallocinit and sysAlloc. These fields fundamentally assume a
contiguous heap between two bounds, so eliminating these is necessary
for a sparse heap.
Many of these are replaced with checks for non-nil spans at the test
address (which in turn checks for a non-nil entry in the heap arena
array). Some of them are just for debugging and somewhat meaningless
with a sparse heap, so those we just delete.
Updates #10460.
Change-Id: I8345b95ffc610aed694f08f74633b3c63506a41f
Reviewed-on: https://go-review.googlesource.com/85886
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, we mix objects with pointers and objects without pointers
("noscan" objects) together in memory. As a result, for every object
we grey, we have to check that object's heap bits to find out if it's
noscan, which adds to the per-object cost of GC. This also hurts the
TLB footprint of the garbage collector because it decreases the
density of scannable objects at the page level.
This commit improves the situation by using separate spans for noscan
objects. This will allow a much simpler noscan check (in a follow up
CL), eliminate the need to clear the bitmap of noscan objects (in a
follow up CL), and improves TLB footprint by increasing the density of
scannable objects.
This is also a step toward eliminating dead bits, since the current
noscan check depends on checking the dead bit of the first word.
This has no effect on the heap size of the garbage benchmark.
We'll measure the performance change of this after the follow-up
optimizations.
This is a cherry-pick from dev.garbage commit d491e550c3. The only
non-trivial merge conflict was in updatememstats in mstats.go, where
we now have to separate the per-spanclass stats from the per-sizeclass
stats.
Change-Id: I13bdc4869538ece5649a8d2a41c6605371618e40
Reviewed-on: https://go-review.googlesource.com/41251
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, proportional sweep maintains its own count of how many
bytes have been allocated since the beginning of the sweep cycle so it
can compute how many pages need to be swept for a given allocation.
However, this requires a somewhat complex reimbursement scheme since
proportional sweep must be done before a span is allocated, but we
don't know how many bytes to charge until we've allocated a span. This
means that the allocated byte count used by proportional sweep can go
up and down, which has led to underflow bugs in the past (#18043) and
is going to interfere with adjusting sweep pacing on-the-fly (for #19076).
This approach also means we're maintaining a statistic that is very
closely related to heap_live, but has a different 0 value. This is
particularly confusing because the sweep ratio is computed based on
heap_live, so you have to understand that these two statistics are
very closely related.
Replace all of this and compute the sweep debt directly from the
current value of heap_live. To make this work, we simply save the
value of heap_live when the sweep ratio is computed to use as a
"basis" for later computing the sweep debt.
This eliminates the need for reimbursement as well as the code for
maintaining the sweeper's version of the live heap size.
For #19076.
Coincidentally fixes#18043, since this eliminates sweep reimbursement
entirely.
Change-Id: I1f931ddd6e90c901a3972c7506874c899251dc2a
Reviewed-on: https://go-review.googlesource.com/39832
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, each individual span sweep emits a span to the trace. But
sweeps are generally done in loops until some condition is satisfied,
so this tracing is lower-level than anyone really wants any hides the
fact that no other work is being accomplished between adjacent sweep
events. This is also high overhead: enabling tracing significantly
impacts sweep latency.
Replace this with instead tracing around the sweep loops used for
allocation. This is slightly tricky because sweep loops don't
generally know if any sweeping will happen in them. Hence, we make the
tracing lazy by recording in the P that we would like to start tracing
the sweep *if* one happens, and then only closing the sweep event if
we started it.
This does mean we don't get tracing on every sweep path, which are
legion. However, we get much more informative tracing on the paths
that block allocation, which are the paths that matter.
Change-Id: I73e14fbb250acb0c9d92e3648bddaa5e7d7e271c
Reviewed-on: https://go-review.googlesource.com/40810
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently ReadMemStats stops the world for ~1.7 ms/GB of heap because
it collects statistics from every single span. For large heaps, this
can be quite costly. This is particularly unfortunate because many
production infrastructures call this function regularly to collect and
report statistics.
Fix this by tracking the necessary cumulative statistics in the
mcaches. ReadMemStats still has to stop the world to stabilize these
statistics, but there are only O(GOMAXPROCS) mcaches to collect
statistics from, so this pause is only 25µs even at GOMAXPROCS=100.
Fixes#13613.
Change-Id: I3c0a4e14833f4760dab675efc1916e73b4c0032a
Reviewed-on: https://go-review.googlesource.com/34937
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
This covers basically all sysAlloc'd, persistentalloc'd, and
fixalloc'd types.
Change-Id: I0487c887c2a0ade5e33d4c4c12d837e97468e66b
Reviewed-on: https://go-review.googlesource.com/30941
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently we always zero objects when we allocate them. We used to
have an optimization that would not zero objects that had not been
allocated since the whole span was last zeroed (either by getting it
from the system or by getting it from the heap, which does a bulk
zero), but this depended on the sweeper clobbering the first two words
of each object. Hence, we lost this optimization when the bitmap
sweeper went away.
Re-introduce this optimization using a different mechanism. Each span
already keeps a flag indicating that it just came from the OS or was
just bulk zeroed by the mheap. We can simply use this flag to know
when we don't need to zero an object. This is slightly less efficient
than the old optimization: if a span gets allocated and partially
used, then GC happens and the span gets returned to the mcentral, then
the span gets re-acquired, the old optimization knew that it only had
to re-zero the objects that had been reclaimed, whereas this
optimization will re-zero everything. However, in this case, you're
already paying for the garbage collection, and you've only wasted one
zeroing of the span, so in practice there seems to be little
difference. (If we did want to revive the full optimization, each span
could keep track of a frontier beyond which all free slots are zeroed.
I prototyped this and it didn't obvious do any better than the much
simpler approach in this commit.)
This significantly improves BinaryTree17, which is allocation-heavy
(and runs first, so most pages are already zeroed), and slightly
improves everything else.
name old time/op new time/op delta
XBenchGarbage-12 2.15ms ± 1% 2.14ms ± 1% -0.80% (p=0.000 n=17+17)
name old time/op new time/op delta
BinaryTree17-12 2.71s ± 1% 2.56s ± 1% -5.73% (p=0.000 n=18+19)
DivconstI64-12 1.70ns ± 1% 1.70ns ± 1% ~ (p=0.562 n=18+18)
DivconstU64-12 1.74ns ± 2% 1.74ns ± 1% ~ (p=0.394 n=20+20)
DivconstI32-12 1.74ns ± 0% 1.74ns ± 0% ~ (all samples are equal)
DivconstU32-12 1.66ns ± 1% 1.66ns ± 0% ~ (p=0.516 n=15+16)
DivconstI16-12 1.84ns ± 0% 1.84ns ± 0% ~ (all samples are equal)
DivconstU16-12 1.82ns ± 0% 1.82ns ± 0% ~ (all samples are equal)
DivconstI8-12 1.79ns ± 0% 1.79ns ± 0% ~ (all samples are equal)
DivconstU8-12 1.60ns ± 0% 1.60ns ± 1% ~ (p=0.603 n=17+19)
Fannkuch11-12 2.11s ± 1% 2.11s ± 0% ~ (p=0.333 n=16+19)
FmtFprintfEmpty-12 45.1ns ± 4% 45.4ns ± 5% ~ (p=0.111 n=20+20)
FmtFprintfString-12 134ns ± 0% 129ns ± 0% -3.45% (p=0.000 n=18+16)
FmtFprintfInt-12 131ns ± 1% 129ns ± 1% -1.54% (p=0.000 n=16+18)
FmtFprintfIntInt-12 205ns ± 2% 203ns ± 0% -0.56% (p=0.014 n=20+18)
FmtFprintfPrefixedInt-12 200ns ± 2% 197ns ± 1% -1.48% (p=0.000 n=20+18)
FmtFprintfFloat-12 256ns ± 1% 256ns ± 0% -0.21% (p=0.008 n=18+20)
FmtManyArgs-12 805ns ± 0% 804ns ± 0% -0.19% (p=0.001 n=18+18)
GobDecode-12 7.21ms ± 1% 7.14ms ± 1% -0.92% (p=0.000 n=19+20)
GobEncode-12 5.88ms ± 1% 5.88ms ± 1% ~ (p=0.641 n=18+19)
Gzip-12 218ms ± 1% 218ms ± 1% ~ (p=0.271 n=19+18)
Gunzip-12 37.1ms ± 0% 36.9ms ± 0% -0.29% (p=0.000 n=18+17)
HTTPClientServer-12 78.1µs ± 2% 77.4µs ± 2% ~ (p=0.070 n=19+19)
JSONEncode-12 15.5ms ± 1% 15.5ms ± 0% ~ (p=0.063 n=20+18)
JSONDecode-12 56.1ms ± 0% 55.4ms ± 1% -1.18% (p=0.000 n=19+18)
Mandelbrot200-12 4.05ms ± 0% 4.06ms ± 0% +0.29% (p=0.001 n=18+18)
GoParse-12 3.28ms ± 1% 3.21ms ± 1% -2.30% (p=0.000 n=20+20)
RegexpMatchEasy0_32-12 69.4ns ± 2% 69.3ns ± 1% ~ (p=0.205 n=18+16)
RegexpMatchEasy0_1K-12 239ns ± 0% 239ns ± 0% ~ (all samples are equal)
RegexpMatchEasy1_32-12 69.4ns ± 1% 69.4ns ± 1% ~ (p=0.620 n=15+18)
RegexpMatchEasy1_1K-12 370ns ± 1% 369ns ± 2% ~ (p=0.088 n=20+20)
RegexpMatchMedium_32-12 108ns ± 0% 108ns ± 0% ~ (all samples are equal)
RegexpMatchMedium_1K-12 33.6µs ± 3% 33.5µs ± 3% ~ (p=0.718 n=20+20)
RegexpMatchHard_32-12 1.68µs ± 1% 1.67µs ± 2% ~ (p=0.316 n=20+20)
RegexpMatchHard_1K-12 50.5µs ± 3% 50.4µs ± 3% ~ (p=0.659 n=20+20)
Revcomp-12 381ms ± 1% 381ms ± 1% ~ (p=0.916 n=19+18)
Template-12 66.5ms ± 1% 65.8ms ± 2% -1.08% (p=0.000 n=20+20)
TimeParse-12 317ns ± 0% 319ns ± 0% +0.48% (p=0.000 n=19+12)
TimeFormat-12 338ns ± 0% 338ns ± 0% ~ (p=0.124 n=19+18)
[Geo mean] 5.99µs 5.96µs -0.54%
Change-Id: I638ffd9d9f178835bbfa499bac20bd7224f1a907
Reviewed-on: https://go-review.googlesource.com/22591
Reviewed-by: Rick Hudson <rlh@golang.org>
These used to be used for the list of newly freed objects, but that's
no longer a thing.
Change-Id: I5a4503137b74ec0eae5372ca271b1aa0b32df074
Reviewed-on: https://go-review.googlesource.com/22557
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Prior to this CL the base of a span was calculated in various
places using shifts or calls to base(). This CL now
always calls base() which has been optimized to calculate the
base of the span when the span is initialized and store that
value in the span structure.
Change-Id: I661f2bfa21e3748a249cdf049ef9062db6e78100
Reviewed-on: https://go-review.googlesource.com/20703
Reviewed-by: Austin Clements <austin@google.com>
Add to each span a 64 bit cache (allocCache) of the allocBits
at freeindex. allocCache is shifted such that the lowest bit
corresponds to the bit freeindex. allocBits uses a 0 to
indicate an object is free, on the other hand allocCache
uses a 1 to indicate an object is free. This facilitates
ctz64 (count trailing zero) which counts the number of 0s
trailing the least significant 1. This is also the index of
the least significant 1.
Each span maintains a freeindex indicating the boundary
between allocated objects and unallocated objects. allocCache
is shifted as freeindex is incremented such that the low bit
in allocCache corresponds to the bit a freeindex in the
allocBits array.
Currently ctz64 is written in Go using a for loop so it is
not very efficient. Use of the hardware instruction will
follow. With this in mind comparisons of the garbage
benchmark are as follows.
1.6 release 2.8 seconds
dev:garbage branch 3.1 seconds.
Profiling shows the go implementation of ctz64 takes up
1% of the total time.
Change-Id: If084ed9c3b1eda9f3c6ab2e794625cb870b8167f
Reviewed-on: https://go-review.googlesource.com/20200
Reviewed-by: Austin Clements <austin@google.com>
This is a renaming of the field ref to the
more appropriate allocCount. The field
holds the number of objects in the span
that are currently allocated. Some throws
strings were adjusted to more accurately
convey the meaning of allocCount.
Change-Id: I10daf44e3e9cc24a10912638c7de3c1984ef8efe
Reviewed-on: https://go-review.googlesource.com/19518
Reviewed-by: Austin Clements <austin@google.com>
Instead of building a freelist from the mark bits generated
by the GC this CL allocates directly from the mark bits.
The approach moves the mark bits from the pointer/no pointer
heap structures into their own per span data structures. The
mark/allocation vectors consist of a single mark bit per
object. Two vectors are maintained, one for allocation and
one for the GC's mark phase. During the GC cycle's sweep
phase the interpretation of the vectors is swapped. The
mark vector becomes the allocation vector and the old
allocation vector is cleared and becomes the mark vector that
the next GC cycle will use.
Marked entries in the allocation vector indicate that the
object is not free. Each allocation vector maintains a boundary
between areas of the span already allocated from and areas
not yet allocated from. As objects are allocated this boundary
is moved until it reaches the end of the span. At this point
further allocations will be done from another span.
Since we no longer sweep a span inspecting each freed object
the responsibility for maintaining pointer/scalar bits in
the heapBitMap containing is now the responsibility of the
the routines doing the actual allocation.
This CL is functionally complete and ready for performance
tuning.
Change-Id: I336e0fc21eef1066e0b68c7067cc71b9f3d50e04
Reviewed-on: https://go-review.googlesource.com/19470
Reviewed-by: Austin Clements <austin@google.com>
The gcmarkBits is a bit vector used by the GC to mark
reachable objects. Once a GC cycle is complete the gcmarkBits
swap places with the allocBits. allocBits is then used directly
by malloc to locate free objects, thus avoiding the
construction of a linked free list. This CL introduces a set
of helper functions for manipulating gcmarkBits and allocBits
that will be used by later CLs to realize the actual
algorithm. Minimal attempts have been made to optimize these
helper routines.
Change-Id: I55ad6240ca32cd456e8ed4973c6970b3b882dd34
Reviewed-on: https://go-review.googlesource.com/19420
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Rick Hudson <rlh@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
The tree's pretty inconsistent about single space vs double space
after a period in documentation. Make it consistently a single space,
per earlier decisions. This means contributors won't be confused by
misleading precedence.
This CL doesn't use go/doc to parse. It only addresses // comments.
It was generated with:
$ perl -i -npe 's,^(\s*// .+[a-z]\.) +([A-Z]),$1 $2,' $(git grep -l -E '^\s*//(.+\.) +([A-Z])')
$ go test go/doc -update
Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7
Reviewed-on: https://go-review.googlesource.com/20022
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Dave Day <djd@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently, we update memstats.heap_live from mcache.local_cachealloc
whenever we lock the heap (e.g., to obtain a fresh span or to release
an unused span). However, under the right circumstances,
local_cachealloc can accumulate allocations up to the size of
the *entire heap* without flushing them to heap_live. Specifically,
since span allocations from an mcentral don't lock the heap, if a
large number of pages are held in an mcentral and the application
continues to use and free objects of that size class (e.g., the
BinaryTree17 benchmark), local_cachealloc won't be flushed until the
mcentral runs out of spans.
This is a problem because, unlike many of the memory statistics that
are purely informative, heap_live is used to determine when the
garbage collector should start and how hard it should work.
This commit eliminates local_cachealloc, instead atomically updating
heap_live directly. To control contention, we do this only when
obtaining a span from an mcentral. Furthermore, we make heap_live
conservative: allocating a span assumes that all free slots in that
span will be used and accounts for these when the span is
allocated, *before* the objects themselves are. This is important
because 1) this triggers the GC earlier than necessary rather than
potentially too late and 2) this leads to a conservative GC rate
rather than a GC rate that is potentially too low.
Alternatively, we could have flushed local_cachealloc when it passed
some threshold, but this would require determining a threshold and
would cause heap_live to underestimate the true value rather than
overestimate.
Fixes#12199.
name old time/op new time/op delta
BinaryTree17-12 2.88s ± 4% 2.88s ± 1% ~ (p=0.470 n=19+19)
Fannkuch11-12 2.48s ± 1% 2.48s ± 1% ~ (p=0.243 n=16+19)
FmtFprintfEmpty-12 50.9ns ± 2% 50.7ns ± 1% ~ (p=0.238 n=15+14)
FmtFprintfString-12 175ns ± 1% 171ns ± 1% -2.48% (p=0.000 n=18+18)
FmtFprintfInt-12 159ns ± 1% 158ns ± 1% -0.78% (p=0.000 n=19+18)
FmtFprintfIntInt-12 270ns ± 1% 265ns ± 2% -1.67% (p=0.000 n=18+18)
FmtFprintfPrefixedInt-12 235ns ± 1% 234ns ± 0% ~ (p=0.362 n=18+19)
FmtFprintfFloat-12 309ns ± 1% 308ns ± 1% -0.41% (p=0.001 n=18+19)
FmtManyArgs-12 1.10µs ± 1% 1.08µs ± 0% -1.96% (p=0.000 n=19+18)
GobDecode-12 7.81ms ± 1% 7.80ms ± 1% ~ (p=0.425 n=18+19)
GobEncode-12 6.53ms ± 1% 6.53ms ± 1% ~ (p=0.817 n=19+19)
Gzip-12 312ms ± 1% 312ms ± 2% ~ (p=0.967 n=19+20)
Gunzip-12 42.0ms ± 1% 41.9ms ± 1% ~ (p=0.172 n=19+19)
HTTPClientServer-12 63.7µs ± 1% 63.8µs ± 1% ~ (p=0.639 n=19+19)
JSONEncode-12 16.4ms ± 1% 16.4ms ± 1% ~ (p=0.954 n=19+19)
JSONDecode-12 58.5ms ± 1% 57.8ms ± 1% -1.27% (p=0.000 n=18+19)
Mandelbrot200-12 3.86ms ± 1% 3.88ms ± 0% +0.44% (p=0.000 n=18+18)
GoParse-12 3.67ms ± 2% 3.66ms ± 1% -0.52% (p=0.001 n=18+19)
RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 0% ~ (p=0.257 n=19+18)
RegexpMatchEasy0_1K-12 347ns ± 1% 347ns ± 1% ~ (p=0.527 n=18+18)
RegexpMatchEasy1_32-12 83.7ns ± 2% 83.1ns ± 2% ~ (p=0.096 n=18+19)
RegexpMatchEasy1_1K-12 509ns ± 1% 505ns ± 1% -0.75% (p=0.000 n=18+19)
RegexpMatchMedium_32-12 130ns ± 2% 129ns ± 1% ~ (p=0.962 n=20+20)
RegexpMatchMedium_1K-12 39.5µs ± 2% 39.4µs ± 1% ~ (p=0.376 n=20+19)
RegexpMatchHard_32-12 2.04µs ± 0% 2.04µs ± 1% ~ (p=0.195 n=18+17)
RegexpMatchHard_1K-12 61.4µs ± 1% 61.4µs ± 1% ~ (p=0.885 n=19+19)
Revcomp-12 540ms ± 2% 542ms ± 4% ~ (p=0.552 n=19+17)
Template-12 69.6ms ± 1% 71.2ms ± 1% +2.39% (p=0.000 n=20+20)
TimeParse-12 357ns ± 1% 357ns ± 1% ~ (p=0.883 n=18+20)
TimeFormat-12 379ns ± 1% 362ns ± 1% -4.53% (p=0.000 n=18+19)
[Geo mean] 62.0µs 61.8µs -0.44%
name old time/op new time/op delta
XBenchGarbage-12 5.89ms ± 2% 5.81ms ± 2% -1.41% (p=0.000 n=19+18)
Change-Id: I96b31cca6ae77c30693a891cff3fe663fa2447a0
Reviewed-on: https://go-review.googlesource.com/17748
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
deductSweepCredit expects the size in bytes of the span being
allocated, but mCentral_CacheSpan passes the size of a single object
in the span. As a result, we don't sweep enough on that call and when
mCentral_CacheSpan later calls reimburseSweepCredit, it's very likely
to underflow mheap_.spanBytesAlloc, which causes the next call to
deductSweepCredit to think it owes a huge number of pages and finish
off the whole sweep.
In addition to causing the occasional allocation that triggers the
full sweep to be potentially extremely expensive relative to other
allocations, this can indirectly slow down many other allocations.
deductSweepCredit uses sweepone to sweep spans, which returns
fully-unused spans to the heap, where these spans are freed and
coalesced with neighboring free spans. On the other hand, when
mCentral_CacheSpan sweeps a span, it does so with the intent to
immediately reuse that span and, as a result, will not return the span
to the heap even if it is fully unused. This saves on the cost of
locking the heap, finding a span, and initializing that span. For
example, before this change, with GOMAXPROCS=1 (or the background
sweeper disabled) BinaryTree17 returned roughly 220K spans to the heap
and allocated new spans from the heap roughly 232K times. After this
change, it returns 1.3K spans to the heap and allocates new spans from
the heap 39K times. (With background sweeping these numbers are
effectively unchanged because the background sweeper sweeps almost all
of the spans with sweepone; however, parallel sweeping saves more than
the cost of allocating spans from the heap.)
Fixes#13535.
Fixes#13589.
name old time/op new time/op delta
BinaryTree17-12 3.03s ± 1% 2.86s ± 4% -5.61% (p=0.000 n=18+20)
Fannkuch11-12 2.48s ± 1% 2.49s ± 1% ~ (p=0.060 n=17+20)
FmtFprintfEmpty-12 50.7ns ± 1% 50.9ns ± 1% +0.43% (p=0.025 n=15+16)
FmtFprintfString-12 174ns ± 2% 174ns ± 2% ~ (p=0.539 n=19+20)
FmtFprintfInt-12 158ns ± 1% 158ns ± 1% ~ (p=0.300 n=18+20)
FmtFprintfIntInt-12 269ns ± 2% 269ns ± 2% ~ (p=0.784 n=20+18)
FmtFprintfPrefixedInt-12 233ns ± 1% 234ns ± 1% ~ (p=0.389 n=18+18)
FmtFprintfFloat-12 309ns ± 1% 310ns ± 1% +0.25% (p=0.048 n=18+18)
FmtManyArgs-12 1.10µs ± 1% 1.10µs ± 1% ~ (p=0.259 n=18+19)
GobDecode-12 7.81ms ± 1% 7.72ms ± 1% -1.17% (p=0.000 n=19+19)
GobEncode-12 6.56ms ± 0% 6.55ms ± 1% ~ (p=0.433 n=17+19)
Gzip-12 318ms ± 2% 317ms ± 1% ~ (p=0.578 n=19+18)
Gunzip-12 42.1ms ± 2% 42.0ms ± 0% -0.45% (p=0.007 n=18+16)
HTTPClientServer-12 63.9µs ± 1% 64.0µs ± 1% ~ (p=0.146 n=17+19)
JSONEncode-12 16.4ms ± 1% 16.4ms ± 1% ~ (p=0.271 n=19+19)
JSONDecode-12 58.1ms ± 1% 58.0ms ± 1% ~ (p=0.152 n=18+18)
Mandelbrot200-12 3.85ms ± 0% 3.85ms ± 0% ~ (p=0.126 n=19+18)
GoParse-12 3.71ms ± 1% 3.64ms ± 1% -1.86% (p=0.000 n=20+18)
RegexpMatchEasy0_32-12 100ns ± 2% 100ns ± 1% ~ (p=0.588 n=20+20)
RegexpMatchEasy0_1K-12 346ns ± 1% 347ns ± 1% +0.27% (p=0.014 n=17+20)
RegexpMatchEasy1_32-12 82.9ns ± 3% 83.5ns ± 3% ~ (p=0.096 n=19+20)
RegexpMatchEasy1_1K-12 506ns ± 1% 506ns ± 1% ~ (p=0.530 n=19+19)
RegexpMatchMedium_32-12 129ns ± 2% 129ns ± 1% ~ (p=0.566 n=20+19)
RegexpMatchMedium_1K-12 39.4µs ± 1% 39.4µs ± 1% ~ (p=0.713 n=19+20)
RegexpMatchHard_32-12 2.05µs ± 1% 2.06µs ± 1% +0.36% (p=0.008 n=18+20)
RegexpMatchHard_1K-12 61.6µs ± 1% 61.7µs ± 1% ~ (p=0.286 n=19+20)
Revcomp-12 538ms ± 1% 541ms ± 2% ~ (p=0.081 n=18+19)
Template-12 71.5ms ± 2% 71.6ms ± 1% ~ (p=0.513 n=20+19)
TimeParse-12 357ns ± 1% 357ns ± 1% ~ (p=0.935 n=19+18)
TimeFormat-12 352ns ± 1% 352ns ± 1% ~ (p=0.293 n=19+20)
[Geo mean] 62.0µs 61.9µs -0.21%
name old time/op new time/op delta
XBenchGarbage-12 5.83ms ± 2% 5.86ms ± 3% ~ (p=0.247 n=19+20)
Change-Id: I790bb530adace27ccf25d372f24a11954b88443c
Reviewed-on: https://go-review.googlesource.com/17745
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Applies to types fixAlloc, mCache, mCentral, mHeap, mSpan, and
mSpanList.
Two special cases:
1. mHeap_Scavenge() previously didn't take an *mheap parameter, so it
was specially handled in this CL.
2. mHeap_Free() would have collided with mheap's "free" field, so it's
been renamed to (*mheap).freeSpan to parallel its underlying
(*mheap).freeSpanLocked method.
Change-Id: I325938554cca432c166fe9d9d689af2bbd68de4b
Reviewed-on: https://go-review.googlesource.com/16221
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>