Commit Graph

11 Commits

Author SHA1 Message Date
Cherry Mui 9f252a0462 all: delete ARM64 non-register ABI fallback path
Change-Id: I3996fb31789a1f8559348e059cf371774e548a8d
Reviewed-on: https://go-review.googlesource.com/c/go/+/393875
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2022-03-18 18:26:13 +00:00
Cherry Mui 5a40fab19f [dev.typeparams] runtime, internal/bytealg: port performance-critical functions to register ABI on ARM64
This CL ports a few performance-critical assembly functions to use
register arguments directly. This is similar to CL 308931 and
CL 310184.

Change-Id: I6e30dfff17f76b8578ce8cfd51de21b66610fdb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/324400
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
2021-06-03 16:28:19 +00:00
Jonathan Swinney d5388e23b5 runtime: improve memmove performance on arm64
Replace the memmove implementation for moves of 17 bytes or larger
with an implementation from ARM optimized software. The moves of 16
bytes or fewer are unchanged, but the registers used are updated to
match the rest of the implementation.

This implementation makes use of new optimizations:
 - software pipelined loop for large (>128 byte) moves
 - medium size moves (17..128 bytes) have a new implementation
 - address realignment when src or dst is unaligned
 - preference for aligned src (loads) or dst (stores) depending on CPU

To support preference for aligned loads or aligned stores, a new CPU
flag is added. This flag indicates that the detected micro
architecture performs better with aligned loads. Some tested CPUs did
not exhibit a significant difference and are left with the default
behavior of realigning based on the destination address (stores).

Neoverse N1 (Tested on Graviton 2)
name                               old time/op    new time/op     delta
Memmove/0-4                          1.88ns ± 1%     1.87ns ± 1%   -0.58%  (p=0.020 n=10+10)
Memmove/1-4                          4.40ns ± 0%     4.40ns ± 0%     ~     (all equal)
Memmove/8-4                          3.88ns ± 3%     3.80ns ± 0%   -1.97%  (p=0.001 n=10+9)
Memmove/16-4                         3.90ns ± 3%     3.80ns ± 0%   -2.49%  (p=0.000 n=10+9)
Memmove/32-4                         4.80ns ± 0%     4.40ns ± 0%   -8.33%  (p=0.000 n=9+8)
Memmove/64-4                         5.86ns ± 0%     5.00ns ± 0%  -14.76%  (p=0.000 n=8+8)
Memmove/128-4                        8.46ns ± 0%     8.06ns ± 0%   -4.62%  (p=0.000 n=10+10)
Memmove/256-4                        12.4ns ± 0%     12.2ns ± 0%   -1.61%  (p=0.000 n=10+10)
Memmove/512-4                        19.5ns ± 0%     19.1ns ± 0%   -2.05%  (p=0.000 n=10+10)
Memmove/1024-4                       33.7ns ± 0%     33.5ns ± 0%   -0.59%  (p=0.000 n=10+10)
Memmove/2048-4                       62.1ns ± 0%     59.0ns ± 0%   -4.99%  (p=0.000 n=10+10)
Memmove/4096-4                        117ns ± 1%      110ns ± 0%   -5.66%  (p=0.000 n=10+10)
MemmoveUnalignedDst/64-4             6.41ns ± 0%     5.62ns ± 0%  -12.32%  (p=0.000 n=10+7)
MemmoveUnalignedDst/128-4            9.40ns ± 0%     8.34ns ± 0%  -11.24%  (p=0.000 n=10+10)
MemmoveUnalignedDst/256-4            12.8ns ± 0%     12.8ns ± 0%     ~     (all equal)
MemmoveUnalignedDst/512-4            20.4ns ± 0%     19.7ns ± 0%   -3.43%  (p=0.000 n=9+10)
MemmoveUnalignedDst/1024-4           34.1ns ± 0%     35.1ns ± 0%   +2.93%  (p=0.000 n=9+9)
MemmoveUnalignedDst/2048-4           61.5ns ± 0%     60.4ns ± 0%   -1.77%  (p=0.000 n=10+10)
MemmoveUnalignedDst/4096-4            122ns ± 0%      113ns ± 0%   -7.38%  (p=0.002 n=8+10)
MemmoveUnalignedSrc/64-4             7.25ns ± 1%     6.26ns ± 0%  -13.64%  (p=0.000 n=9+9)
MemmoveUnalignedSrc/128-4            10.5ns ± 0%      9.7ns ± 0%   -7.52%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/256-4            17.1ns ± 0%     17.3ns ± 0%   +1.17%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/512-4            27.0ns ± 0%     27.0ns ± 0%     ~     (all equal)
MemmoveUnalignedSrc/1024-4           46.7ns ± 0%     35.7ns ± 0%  -23.55%  (p=0.000 n=10+9)
MemmoveUnalignedSrc/2048-4           85.2ns ± 0%     61.2ns ± 0%  -28.17%  (p=0.000 n=10+8)
MemmoveUnalignedSrc/4096-4            162ns ± 0%      113ns ± 0%  -30.25%  (p=0.000 n=10+10)

name                               old speed      new speed       delta
Memmove/4096-4                     35.2GB/s ± 0%   37.1GB/s ± 0%   +5.56%  (p=0.000 n=10+9)
MemmoveUnalignedSrc/1024-4         21.9GB/s ± 0%   28.7GB/s ± 0%  +30.90%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/2048-4         24.0GB/s ± 0%   33.5GB/s ± 0%  +39.18%  (p=0.000 n=10+9)
MemmoveUnalignedSrc/4096-4         25.3GB/s ± 0%   36.2GB/s ± 0%  +43.50%  (p=0.000 n=10+7)

Cortex-A72 (Graviton 1)
name                               old time/op    new time/op    delta
Memmove/0-4                          3.06ns ± 3%    3.08ns ± 1%     ~     (p=0.958 n=10+9)
Memmove/1-4                          8.72ns ± 0%    7.85ns ± 0%   -9.98%  (p=0.002 n=8+10)
Memmove/8-4                          8.29ns ± 0%    8.29ns ± 0%     ~     (all equal)
Memmove/16-4                         8.29ns ± 0%    8.29ns ± 0%     ~     (all equal)
Memmove/32-4                         8.19ns ± 2%    8.29ns ± 0%     ~     (p=0.114 n=10+10)
Memmove/64-4                         18.3ns ± 4%    10.0ns ± 0%  -45.36%  (p=0.000 n=10+10)
Memmove/128-4                        14.8ns ± 0%    17.4ns ± 0%  +17.77%  (p=0.000 n=10+10)
Memmove/256-4                        21.8ns ± 0%    23.1ns ± 0%   +5.96%  (p=0.000 n=10+10)
Memmove/512-4                        35.8ns ± 0%    37.2ns ± 0%   +3.91%  (p=0.000 n=10+10)
Memmove/1024-4                       63.7ns ± 0%    67.2ns ± 0%   +5.49%  (p=0.000 n=10+10)
Memmove/2048-4                        126ns ± 0%     123ns ± 0%   -2.38%  (p=0.000 n=10+10)
Memmove/4096-4                        238ns ± 1%     243ns ± 1%   +1.93%  (p=0.000 n=10+10)
MemmoveUnalignedDst/64-4             19.3ns ± 1%    12.0ns ± 1%  -37.49%  (p=0.000 n=10+10)
MemmoveUnalignedDst/128-4            17.2ns ± 0%    17.4ns ± 0%   +1.16%  (p=0.000 n=10+10)
MemmoveUnalignedDst/256-4            28.2ns ± 8%    29.2ns ± 0%     ~     (p=0.352 n=10+10)
MemmoveUnalignedDst/512-4            49.8ns ± 3%    48.9ns ± 0%     ~     (p=1.000 n=10+10)
MemmoveUnalignedDst/1024-4           89.5ns ± 0%    80.5ns ± 1%  -10.02%  (p=0.000 n=10+10)
MemmoveUnalignedDst/2048-4            180ns ± 0%     127ns ± 0%  -29.44%  (p=0.000 n=9+10)
MemmoveUnalignedDst/4096-4            347ns ± 0%     244ns ± 0%  -29.59%  (p=0.000 n=10+9)
MemmoveUnalignedSrc/128-4            16.1ns ± 0%    21.8ns ± 0%  +35.40%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/256-4            24.9ns ± 8%    26.6ns ± 0%   +6.70%  (p=0.015 n=10+10)
MemmoveUnalignedSrc/512-4            39.4ns ± 6%    40.6ns ± 0%     ~     (p=0.352 n=10+10)
MemmoveUnalignedSrc/1024-4           72.5ns ± 0%    83.0ns ± 1%  +14.44%  (p=0.000 n=9+10)
MemmoveUnalignedSrc/2048-4            129ns ± 1%     128ns ± 1%     ~     (p=0.179 n=10+10)
MemmoveUnalignedSrc/4096-4            241ns ± 0%     253ns ± 1%   +4.99%  (p=0.000 n=9+9)

Cortex-A53 (Raspberry Pi 3)
name                               old time/op    new time/op    delta
Memmove/0-4                          11.0ns ± 0%    11.0ns ± 1%     ~     (p=0.294 n=8+10)
Memmove/1-4                          29.6ns ± 0%    28.0ns ± 1%   -5.41%  (p=0.000 n=9+10)
Memmove/8-4                          23.5ns ± 0%    22.1ns ± 0%   -6.11%  (p=0.000 n=8+8)
Memmove/16-4                         23.7ns ± 1%    22.1ns ± 0%   -6.59%  (p=0.000 n=10+8)
Memmove/32-4                         27.9ns ± 0%    27.1ns ± 0%   -3.13%  (p=0.000 n=8+8)
Memmove/64-4                         33.8ns ± 0%    31.5ns ± 1%   -6.99%  (p=0.000 n=8+10)
Memmove/128-4                        45.6ns ± 0%    44.2ns ± 1%   -3.23%  (p=0.000 n=9+10)
Memmove/256-4                        69.3ns ± 0%    69.3ns ± 0%     ~     (p=0.072 n=8+8)
Memmove/512-4                         127ns ± 0%     110ns ± 0%  -13.39%  (p=0.000 n=8+8)
Memmove/1024-4                        222ns ± 0%     205ns ± 1%   -7.66%  (p=0.000 n=7+10)
Memmove/2048-4                        411ns ± 0%     366ns ± 0%  -10.98%  (p=0.000 n=8+9)
Memmove/4096-4                        795ns ± 1%     695ns ± 1%  -12.63%  (p=0.000 n=10+10)
MemmoveUnalignedDst/64-4             44.0ns ± 0%    40.5ns ± 0%   -7.93%  (p=0.000 n=8+8)
MemmoveUnalignedDst/128-4            59.6ns ± 0%    54.9ns ± 0%   -7.85%  (p=0.000 n=9+9)
MemmoveUnalignedDst/256-4            98.2ns ±11%    90.0ns ± 1%     ~     (p=0.130 n=10+10)
MemmoveUnalignedDst/512-4             161ns ± 2%     145ns ± 1%   -9.96%  (p=0.000 n=10+10)
MemmoveUnalignedDst/1024-4            281ns ± 0%     265ns ± 0%   -5.65%  (p=0.000 n=9+8)
MemmoveUnalignedDst/2048-4            528ns ± 0%     482ns ± 0%   -8.73%  (p=0.000 n=8+9)
MemmoveUnalignedDst/4096-4           1.02µs ± 1%    0.92µs ± 0%  -10.00%  (p=0.000 n=10+8)
MemmoveUnalignedSrc/64-4             42.4ns ± 1%    40.5ns ± 0%   -4.39%  (p=0.000 n=10+8)
MemmoveUnalignedSrc/128-4            57.4ns ± 0%    57.0ns ± 1%   -0.75%  (p=0.048 n=9+10)
MemmoveUnalignedSrc/256-4            88.1ns ± 1%    89.6ns ± 0%   +1.70%  (p=0.000 n=9+8)
MemmoveUnalignedSrc/512-4             160ns ± 2%     144ns ± 0%   -9.89%  (p=0.000 n=10+8)
MemmoveUnalignedSrc/1024-4            286ns ± 0%     266ns ± 1%   -6.69%  (p=0.000 n=8+10)
MemmoveUnalignedSrc/2048-4            525ns ± 0%     483ns ± 1%   -7.96%  (p=0.000 n=9+10)
MemmoveUnalignedSrc/4096-4           1.01µs ± 0%    0.92µs ± 1%   -9.40%  (p=0.000 n=8+10)

Change-Id: Ia1144e9d4dfafdece6e167c5e576bf80f254c8ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/243357
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: eric fang <eric.fang@arm.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-11-02 15:23:43 +00:00
Austin Clements 9b5bd30716 runtime: document special memmove requirements
Unlike C's memmove, Go's memmove must be careful to do indivisible
writes of pointer values because it may be racing with the garbage
collector reading the heap.

We've had various bugs related to this over the years (#36101, #13160,
 #12552). Indeed, memmove is a great target for optimization and it's
easy to forget the special requirements of Go's memmove.

The CL documents these (currently unwritten!) requirements. We're also
adding a test that should hopefully keep everyone honest going
forward, though it's hard to be sure we're hitting all cases of
memmove.

Change-Id: I2f59f8d8d6fb42d2f10006b55d605b5efd8ddc24
Reviewed-on: https://go-review.googlesource.com/c/go/+/213418
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-01-22 18:54:48 +00:00
Cherry Zhang ffbc02761a runtime: ensure memmove write pointer atomically on ARM64
If a pointer write is not atomic, if the GC is running
concurrently, it may observe a partially updated pointer, which
may point to unallocated or already dead memory. Most pointer
writes, like the store instructions generated by the compiler,
are already atomic. But we still need to be careful in places
like memmove. In memmove, we don't know which bits are pointers
(or too expensive to query), so we ensure that all aligned
pointer-sized units are written atomically.

Fixes #36101.

Change-Id: I1b3ca24c6b1ac8a8aaf9ee470115e9a89ec1b00b
Reviewed-on: https://go-review.googlesource.com/c/go/+/212626
Reviewed-by: Austin Clements <austin@google.com>
2020-01-02 21:41:13 +00:00
Russ Cox 26a5f6a32e runtime: fix scattered non-tab indentation in assembly
Change-Id: I6940a4c747f2da871263afa6a4e3386395d5cf54
Reviewed-on: https://go-review.googlesource.com/c/go/+/180839
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2019-06-06 00:12:04 +00:00
Michael Munday f2cde55cd6 runtime: use Go function signatures for memclr and memmove comments
The function signatures in the comments used a C-like style. Using
Go function signatures is cleaner.

Change-Id: I1a093ed8fe5df59f3697c613cf3fce58bba4f5c1
Reviewed-on: https://go-review.googlesource.com/113876
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-05-21 13:18:16 +00:00
Balaram Makam 213a75171d runtime: improve arm64 memmove implementation
Improve runtime memmove_arm64.s specializing for small copies and
processing 32 bytes per iteration for 32 bytes or more.

Benchmark results of runtime/Memmove on Amberwing:
name                      old time/op    new time/op     delta
Memmove/0                   7.61ns ± 0%     7.20ns ± 0%     ~     (p=0.053 n=5+7)
Memmove/1                   9.28ns ± 0%     8.80ns ± 0%   -5.17%  (p=0.000 n=4+8)
Memmove/2                   9.65ns ± 0%     9.20ns ± 0%   -4.68%  (p=0.000 n=5+8)
Memmove/3                   10.0ns ± 0%      9.2ns ± 0%   -7.83%  (p=0.000 n=5+8)
Memmove/4                   10.6ns ± 0%      9.2ns ± 0%  -13.21%  (p=0.000 n=5+8)
Memmove/5                   11.0ns ± 0%      9.2ns ± 0%  -16.36%  (p=0.000 n=5+8)
Memmove/6                   12.4ns ± 0%      9.2ns ± 0%  -25.81%  (p=0.000 n=5+8)
Memmove/7                   13.1ns ± 0%      9.2ns ± 0%  -29.56%  (p=0.000 n=5+8)
Memmove/8                   9.10ns ± 1%     9.20ns ± 0%   +1.08%  (p=0.002 n=5+8)
Memmove/9                   9.67ns ± 0%     9.20ns ± 0%   -4.88%  (p=0.000 n=5+8)
Memmove/10                  10.4ns ± 0%      9.2ns ± 0%  -11.54%  (p=0.000 n=5+8)
Memmove/11                  10.9ns ± 0%      9.2ns ± 0%  -15.60%  (p=0.000 n=5+8)
Memmove/12                  11.5ns ± 0%      9.2ns ± 0%  -20.00%  (p=0.000 n=5+8)
Memmove/13                  12.4ns ± 0%      9.2ns ± 0%  -25.81%  (p=0.000 n=5+8)
Memmove/14                  13.1ns ± 0%      9.2ns ± 0%  -29.77%  (p=0.000 n=5+8)
Memmove/15                  13.8ns ± 0%      9.2ns ± 0%  -33.33%  (p=0.000 n=5+8)
Memmove/16                  9.70ns ± 0%     9.20ns ± 0%   -5.19%  (p=0.000 n=5+8)
Memmove/32                  10.6ns ± 0%      9.2ns ± 0%  -13.21%  (p=0.000 n=4+8)
Memmove/64                  13.4ns ± 0%     10.2ns ± 0%  -23.88%  (p=0.000 n=4+8)
Memmove/128                 18.1ns ± 1%     13.2ns ± 0%  -26.99%  (p=0.000 n=5+8)
Memmove/256                 25.2ns ± 0%     16.4ns ± 0%  -34.92%  (p=0.000 n=5+8)
Memmove/512                 36.4ns ± 0%     22.8ns ± 0%  -37.36%  (p=0.000 n=5+8)
Memmove/1024                70.1ns ± 0%     36.8ns ±11%  -47.49%  (p=0.002 n=5+8)
Memmove/2048                 121ns ± 0%       61ns ± 0%     ~     (p=0.053 n=5+7)
Memmove/4096                 224ns ± 0%      120ns ± 0%  -46.43%  (p=0.000 n=5+8)
MemmoveUnalignedDst/0       8.40ns ± 0%     8.00ns ± 0%   -4.76%  (p=0.000 n=5+8)
MemmoveUnalignedDst/1       9.87ns ± 1%    10.00ns ± 0%     ~     (p=0.070 n=5+8)
MemmoveUnalignedDst/2       10.6ns ± 0%     10.4ns ± 0%   -1.89%  (p=0.000 n=5+8)
MemmoveUnalignedDst/3       10.8ns ± 0%     10.4ns ± 0%   -3.70%  (p=0.000 n=5+8)
MemmoveUnalignedDst/4       10.9ns ± 0%     10.3ns ± 0%     ~     (p=0.053 n=5+7)
MemmoveUnalignedDst/5       11.5ns ± 0%     10.3ns ± 1%  -10.22%  (p=0.000 n=4+8)
MemmoveUnalignedDst/6       13.2ns ± 0%     10.4ns ± 1%  -21.50%  (p=0.000 n=5+8)
MemmoveUnalignedDst/7       13.7ns ± 0%     10.3ns ± 1%  -24.64%  (p=0.000 n=4+8)
MemmoveUnalignedDst/8       10.1ns ± 0%     10.4ns ± 0%   +2.97%  (p=0.002 n=5+8)
MemmoveUnalignedDst/9       10.7ns ± 0%     10.4ns ± 0%   -2.80%  (p=0.000 n=5+8)
MemmoveUnalignedDst/10      11.2ns ± 1%     10.4ns ± 0%   -6.81%  (p=0.000 n=5+8)
MemmoveUnalignedDst/11      11.6ns ± 0%     10.4ns ± 0%  -10.34%  (p=0.000 n=5+8)
MemmoveUnalignedDst/12      12.5ns ± 2%     10.4ns ± 0%  -16.53%  (p=0.000 n=5+8)
MemmoveUnalignedDst/13      13.7ns ± 0%     10.4ns ± 0%  -24.09%  (p=0.000 n=5+8)
MemmoveUnalignedDst/14      14.0ns ± 0%     10.4ns ± 0%  -25.71%  (p=0.000 n=5+8)
MemmoveUnalignedDst/15      14.6ns ± 0%     10.4ns ± 0%  -28.77%  (p=0.000 n=5+8)
MemmoveUnalignedDst/16      10.5ns ± 0%     10.4ns ± 0%   -0.95%  (p=0.000 n=5+8)
MemmoveUnalignedDst/32      12.4ns ± 0%     11.6ns ± 0%   -6.05%  (p=0.000 n=5+8)
MemmoveUnalignedDst/64      15.2ns ± 0%     12.3ns ± 0%  -19.08%  (p=0.000 n=5+8)
MemmoveUnalignedDst/128     18.7ns ± 0%     15.2ns ± 0%  -18.72%  (p=0.000 n=5+8)
MemmoveUnalignedDst/256     25.1ns ± 0%     18.6ns ± 0%  -25.90%  (p=0.000 n=5+8)
MemmoveUnalignedDst/512     37.8ns ± 0%     24.4ns ± 0%  -35.45%  (p=0.000 n=5+8)
MemmoveUnalignedDst/1024    74.6ns ± 0%     40.4ns ± 0%     ~     (p=0.053 n=5+7)
MemmoveUnalignedDst/2048     133ns ± 0%       75ns ± 0%  -43.91%  (p=0.000 n=5+8)
MemmoveUnalignedDst/4096     247ns ± 0%      141ns ± 0%  -42.91%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/0       8.40ns ± 0%     8.00ns ± 0%   -4.76%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/1       9.81ns ± 0%    10.00ns ± 0%   +1.98%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/2       10.5ns ± 0%     10.0ns ± 0%   -4.76%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/3       10.7ns ± 1%     10.0ns ± 0%   -6.89%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/4       11.3ns ± 0%     10.0ns ± 0%  -11.50%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/5       11.6ns ± 0%     10.0ns ± 0%  -13.79%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/6       13.6ns ± 0%     10.0ns ± 0%  -26.47%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/7       14.4ns ± 0%     10.0ns ± 0%  -30.75%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/8       9.87ns ± 1%    10.00ns ± 0%     ~     (p=0.070 n=5+8)
MemmoveUnalignedSrc/9       10.4ns ± 0%     10.0ns ± 0%   -3.85%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/10      11.2ns ± 0%     10.0ns ± 0%  -10.71%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/11      11.8ns ± 0%     10.0ns ± 0%  -15.25%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/12      12.1ns ± 0%     10.0ns ± 0%  -17.36%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/13      13.6ns ± 0%     10.0ns ± 0%  -26.47%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/14      14.7ns ± 0%     10.0ns ± 0%  -31.79%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/15      14.4ns ± 0%     10.0ns ± 0%  -30.56%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/16      11.0ns ± 0%     10.0ns ± 0%   -9.09%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/32      11.5ns ± 0%     10.0ns ± 0%  -13.04%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/64      14.9ns ± 0%     11.2ns ± 0%  -24.83%  (p=0.000 n=4+8)
MemmoveUnalignedSrc/128     19.5ns ± 0%     15.2ns ± 0%  -22.05%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/256     27.3ns ± 2%     19.2ns ± 0%  -29.62%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/512     40.4ns ± 0%     27.2ns ± 0%  -32.67%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/1024    75.4ns ± 0%     44.4ns ± 0%  -41.15%  (p=0.000 n=5+8)
MemmoveUnalignedSrc/2048     131ns ± 0%       77ns ± 3%  -41.56%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/4096     248ns ± 0%      145ns ± 0%  -41.53%  (p=0.000 n=5+8)

name                      old speed      new speed       delta
Memmove/1                  108MB/s ± 0%    114MB/s ± 0%   +5.37%  (p=0.004 n=4+8)
Memmove/2                  207MB/s ± 0%    217MB/s ± 0%   +4.85%  (p=0.002 n=5+8)
Memmove/3                  301MB/s ± 0%    326MB/s ± 0%   +8.45%  (p=0.002 n=5+8)
Memmove/4                  377MB/s ± 0%    435MB/s ± 0%  +15.31%  (p=0.004 n=4+8)
Memmove/5                  455MB/s ± 0%    543MB/s ± 0%  +19.46%  (p=0.002 n=5+8)
Memmove/6                  483MB/s ± 0%    652MB/s ± 0%  +34.88%  (p=0.003 n=5+7)
Memmove/7                  537MB/s ± 0%    761MB/s ± 0%  +41.71%  (p=0.002 n=5+8)
Memmove/8                  879MB/s ± 1%    869MB/s ± 0%   -1.15%  (p=0.000 n=5+7)
Memmove/9                  931MB/s ± 0%    978MB/s ± 0%   +5.05%  (p=0.002 n=5+8)
Memmove/10                 960MB/s ± 0%   1086MB/s ± 0%  +13.13%  (p=0.002 n=5+8)
Memmove/11                1.00GB/s ± 0%   1.20GB/s ± 0%  +18.92%  (p=0.003 n=5+7)
Memmove/12                1.04GB/s ± 0%   1.30GB/s ± 0%  +25.40%  (p=0.002 n=5+8)
Memmove/13                1.05GB/s ± 0%   1.41GB/s ± 0%  +34.87%  (p=0.002 n=5+8)
Memmove/14                1.07GB/s ± 0%   1.52GB/s ± 0%  +42.14%  (p=0.002 n=5+8)
Memmove/15                1.09GB/s ± 0%   1.63GB/s ± 0%  +49.91%  (p=0.002 n=5+8)
Memmove/16                1.65GB/s ± 0%   1.74GB/s ± 0%   +5.40%  (p=0.003 n=5+7)
Memmove/32                3.01GB/s ± 0%   3.48GB/s ± 0%  +15.58%  (p=0.003 n=5+7)
Memmove/64                4.76GB/s ± 0%   6.27GB/s ± 0%  +31.75%  (p=0.003 n=5+7)
Memmove/128               7.08GB/s ± 1%   9.69GB/s ± 0%  +36.96%  (p=0.002 n=5+8)
Memmove/256               10.2GB/s ± 0%   15.6GB/s ± 0%  +53.58%  (p=0.002 n=5+8)
Memmove/512               14.1GB/s ± 0%   22.4GB/s ± 0%  +59.57%  (p=0.003 n=5+7)
Memmove/1024              14.6GB/s ± 0%   27.9GB/s ±10%  +91.00%  (p=0.002 n=5+8)
Memmove/2048              16.9GB/s ± 0%   33.4GB/s ± 0%  +98.32%  (p=0.003 n=5+7)
Memmove/4096              18.3GB/s ± 0%   33.9GB/s ± 0%  +85.80%  (p=0.002 n=5+8)
MemmoveUnalignedDst/1      101MB/s ± 1%    100MB/s ± 0%     ~     (p=0.586 n=5+8)
MemmoveUnalignedDst/2      189MB/s ± 0%    192MB/s ± 0%   +1.82%  (p=0.002 n=5+8)
MemmoveUnalignedDst/3      278MB/s ± 0%    288MB/s ± 0%   +3.88%  (p=0.003 n=5+7)
MemmoveUnalignedDst/4      368MB/s ± 0%    387MB/s ± 0%   +5.41%  (p=0.003 n=5+7)
MemmoveUnalignedDst/5      434MB/s ± 0%    484MB/s ± 0%  +11.52%  (p=0.002 n=5+8)
MemmoveUnalignedDst/6      454MB/s ± 0%    580MB/s ± 0%  +27.62%  (p=0.002 n=5+8)
MemmoveUnalignedDst/7      509MB/s ± 0%    677MB/s ± 0%  +33.01%  (p=0.002 n=5+8)
MemmoveUnalignedDst/8      792MB/s ± 0%    770MB/s ± 0%   -2.77%  (p=0.002 n=5+8)
MemmoveUnalignedDst/9      841MB/s ± 0%    866MB/s ± 0%   +2.92%  (p=0.002 n=5+8)
MemmoveUnalignedDst/10     896MB/s ± 0%    962MB/s ± 0%   +7.35%  (p=0.003 n=5+7)
MemmoveUnalignedDst/11     947MB/s ± 0%   1058MB/s ± 0%  +11.80%  (p=0.002 n=5+8)
MemmoveUnalignedDst/12     962MB/s ± 2%   1154MB/s ± 0%  +19.97%  (p=0.002 n=5+8)
MemmoveUnalignedDst/13     947MB/s ± 0%   1251MB/s ± 0%  +32.08%  (p=0.002 n=5+8)
MemmoveUnalignedDst/14    1.00GB/s ± 0%   1.35GB/s ± 0%  +34.55%  (p=0.002 n=5+8)
MemmoveUnalignedDst/15    1.03GB/s ± 0%   1.44GB/s ± 0%  +40.50%  (p=0.002 n=5+8)
MemmoveUnalignedDst/16    1.53GB/s ± 0%   1.54GB/s ± 0%   +0.77%  (p=0.002 n=5+8)
MemmoveUnalignedDst/32    2.58GB/s ± 0%   2.75GB/s ± 0%   +6.52%  (p=0.003 n=5+7)
MemmoveUnalignedDst/64    4.21GB/s ± 0%   5.19GB/s ± 0%  +23.40%  (p=0.004 n=5+6)
MemmoveUnalignedDst/128   6.86GB/s ± 0%   8.42GB/s ± 0%  +22.78%  (p=0.003 n=5+7)
MemmoveUnalignedDst/256   10.2GB/s ± 0%   13.8GB/s ± 0%  +35.15%  (p=0.002 n=5+8)
MemmoveUnalignedDst/512   13.5GB/s ± 0%   21.0GB/s ± 0%  +54.90%  (p=0.002 n=5+8)
MemmoveUnalignedDst/1024  13.7GB/s ± 0%   25.3GB/s ± 0%  +84.61%  (p=0.003 n=5+7)
MemmoveUnalignedDst/2048  15.3GB/s ± 0%   27.5GB/s ± 0%  +79.52%  (p=0.002 n=5+8)
MemmoveUnalignedDst/4096  16.5GB/s ± 0%   28.9GB/s ± 0%  +74.74%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/1      102MB/s ± 0%    100MB/s ± 0%   -2.02%  (p=0.000 n=5+7)
MemmoveUnalignedSrc/2      191MB/s ± 0%    200MB/s ± 0%   +4.78%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/3      279MB/s ± 0%    300MB/s ± 0%   +7.45%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/4      354MB/s ± 0%    400MB/s ± 0%  +13.10%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/5      431MB/s ± 0%    500MB/s ± 0%  +16.02%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/6      441MB/s ± 0%    600MB/s ± 0%  +36.03%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/7      485MB/s ± 0%    700MB/s ± 0%  +44.29%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/8      811MB/s ± 1%    800MB/s ± 0%   -1.36%  (p=0.016 n=5+8)
MemmoveUnalignedSrc/9      864MB/s ± 0%    900MB/s ± 0%   +4.07%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/10     893MB/s ± 0%    999MB/s ± 0%  +11.97%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/11     932MB/s ± 0%   1099MB/s ± 0%  +18.01%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/12     988MB/s ± 0%   1199MB/s ± 0%  +21.35%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/13     955MB/s ± 0%   1299MB/s ± 0%  +36.02%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/14     955MB/s ± 0%   1399MB/s ± 0%  +46.52%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/15    1.04GB/s ± 0%   1.50GB/s ± 0%  +44.18%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/16    1.45GB/s ± 0%   1.60GB/s ± 0%  +10.14%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/32    2.78GB/s ± 0%   3.20GB/s ± 0%  +15.16%  (p=0.003 n=5+7)
MemmoveUnalignedSrc/64    4.30GB/s ± 0%   5.72GB/s ± 0%  +32.90%  (p=0.003 n=5+7)
MemmoveUnalignedSrc/128   6.57GB/s ± 0%   8.42GB/s ± 0%  +28.06%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/256   9.39GB/s ± 1%  13.33GB/s ± 0%  +41.96%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/512   12.7GB/s ± 0%   18.8GB/s ± 0%  +48.53%  (p=0.003 n=5+7)
MemmoveUnalignedSrc/1024  13.6GB/s ± 0%   23.0GB/s ± 0%  +69.82%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/2048  15.6GB/s ± 0%   26.8GB/s ± 3%  +71.37%  (p=0.002 n=5+8)
MemmoveUnalignedSrc/4096  16.5GB/s ± 0%   28.2GB/s ± 0%  +71.40%  (p=0.002 n=5+8)

Fixes #22925

Change-Id: I38c1a9ad5c6e3f4f95fc521c4b7e3140b58b4737
Reviewed-on: https://go-review.googlesource.com/83799
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-03-01 20:34:11 +00:00
Austin Clements beeabbcb25 runtime: use NOFRAME on arm64
This replaces frame size -8 with the NOFRAME flag in arm64 assembly.

This was automated with:

sed -i -e 's/\(^TEXT.*[A-Z]\),\( *\)\$-8/\1|NOFRAME,\2$0/' $(find -name '*_arm64.s')

Plus a manual fix to mkduff.go.

The go binary is identical before and after this change.

Change-Id: I0310384d1a584118c41d1cd3a042bb8ea7227efa
Reviewed-on: https://go-review.googlesource.com/92043
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-12 21:41:31 +00:00
Michael Hudson-Doyle 168a51b3a1 runtime: adjust the arm64 memmove and memclr to operate by word as much as they can
Not only is this an obvious optimization:

benchmark                           old MB/s     new MB/s     speedup
BenchmarkMemmove1-4                 35.35        29.65        0.84x
BenchmarkMemmove2-4                 63.78        52.53        0.82x
BenchmarkMemmove3-4                 89.72        73.96        0.82x
BenchmarkMemmove4-4                 109.94       95.73        0.87x
BenchmarkMemmove5-4                 127.60       112.80       0.88x
BenchmarkMemmove6-4                 143.59       126.67       0.88x
BenchmarkMemmove7-4                 157.90       138.92       0.88x
BenchmarkMemmove8-4                 167.18       231.81       1.39x
BenchmarkMemmove9-4                 175.23       252.07       1.44x
BenchmarkMemmove10-4                165.68       261.10       1.58x
BenchmarkMemmove11-4                174.43       263.31       1.51x
BenchmarkMemmove12-4                180.76       267.56       1.48x
BenchmarkMemmove13-4                189.06       284.93       1.51x
BenchmarkMemmove14-4                186.31       284.72       1.53x
BenchmarkMemmove15-4                195.75       281.62       1.44x
BenchmarkMemmove16-4                202.96       439.23       2.16x
BenchmarkMemmove32-4                264.77       775.77       2.93x
BenchmarkMemmove64-4                306.81       1209.64      3.94x
BenchmarkMemmove128-4               357.03       1515.41      4.24x
BenchmarkMemmove256-4               380.77       2066.01      5.43x
BenchmarkMemmove512-4               385.05       2556.45      6.64x
BenchmarkMemmove1024-4              381.23       2804.10      7.36x
BenchmarkMemmove2048-4              379.06       2814.83      7.43x
BenchmarkMemmove4096-4              387.43       3064.96      7.91x
BenchmarkMemmoveUnaligned1-4        28.91        25.40        0.88x
BenchmarkMemmoveUnaligned2-4        56.13        47.56        0.85x
BenchmarkMemmoveUnaligned3-4        74.32        69.31        0.93x
BenchmarkMemmoveUnaligned4-4        97.02        83.58        0.86x
BenchmarkMemmoveUnaligned5-4        110.17       103.62       0.94x
BenchmarkMemmoveUnaligned6-4        124.95       113.26       0.91x
BenchmarkMemmoveUnaligned7-4        142.37       130.82       0.92x
BenchmarkMemmoveUnaligned8-4        151.20       205.64       1.36x
BenchmarkMemmoveUnaligned9-4        166.97       215.42       1.29x
BenchmarkMemmoveUnaligned10-4       148.49       221.22       1.49x
BenchmarkMemmoveUnaligned11-4       159.47       239.57       1.50x
BenchmarkMemmoveUnaligned12-4       163.52       247.32       1.51x
BenchmarkMemmoveUnaligned13-4       167.55       256.54       1.53x
BenchmarkMemmoveUnaligned14-4       175.12       251.03       1.43x
BenchmarkMemmoveUnaligned15-4       192.10       267.13       1.39x
BenchmarkMemmoveUnaligned16-4       190.76       378.87       1.99x
BenchmarkMemmoveUnaligned32-4       259.02       562.98       2.17x
BenchmarkMemmoveUnaligned64-4       317.72       842.44       2.65x
BenchmarkMemmoveUnaligned128-4      355.43       1274.49      3.59x
BenchmarkMemmoveUnaligned256-4      378.17       1815.74      4.80x
BenchmarkMemmoveUnaligned512-4      362.15       2180.81      6.02x
BenchmarkMemmoveUnaligned1024-4     376.07       2453.58      6.52x
BenchmarkMemmoveUnaligned2048-4     381.66       2568.32      6.73x
BenchmarkMemmoveUnaligned4096-4     398.51       2669.36      6.70x
BenchmarkMemclr5-4                  113.83       107.93       0.95x
BenchmarkMemclr16-4                 223.84       389.63       1.74x
BenchmarkMemclr64-4                 421.99       1209.58      2.87x
BenchmarkMemclr256-4                525.94       2411.58      4.59x
BenchmarkMemclr4096-4               581.66       4372.20      7.52x
BenchmarkMemclr65536-4              565.84       4747.48      8.39x
BenchmarkGoMemclr5-4                194.63       160.31       0.82x
BenchmarkGoMemclr16-4               295.30       630.07       2.13x
BenchmarkGoMemclr64-4               480.24       1884.03      3.92x
BenchmarkGoMemclr256-4              540.23       2926.49      5.42x

but it turns out that it's necessary to avoid the GC seeing partially written
pointers.

It's of course possible to be more sophisticated (using ldp/stp to move 16
bytes at a time in the core loop and unrolling the tail copying loops being
the obvious ideas) but I wanted something simple and (reasonably) obviously
correct.

Fixes #12552

Change-Id: Iaeaf8a812cd06f4747ba2f792de1ded738890735
Reviewed-on: https://go-review.googlesource.com/14813
Reviewed-by: Austin Clements <austin@google.com>
2015-10-08 07:49:35 +00:00
Aram Hăvărneanu 846ee0465b runtime: add support for linux/arm64
Change-Id: Ibda6a5bedaff57fd161d63fc04ad260931d34413
Reviewed-on: https://go-review.googlesource.com/7142
Reviewed-by: Russ Cox <rsc@golang.org>
2015-03-16 18:45:54 +00:00