Commit Graph

97 Commits

Author SHA1 Message Date
Lynn Boger e23322e2cc cmd/internal/obj/ppc64: modify PCALIGN to ensure alignment
The initial purpose of PCALIGN was to identify code
where it would be beneficial to align code for performance,
but avoid cases where too many NOPs were added. On p10, it
is now necessary to enforce a certain alignment in some
cases, so the behavior of PCALIGN needs to be slightly
different.  Code will now be aligned to the value specified
on the PCALIGN instruction regardless of number of NOPs added,
which is more intuitive and consistent with power assembler
alignment directives.

This also adds 64 as a possible alignment value.

The existing values used in PCALIGN were modified according to
the new behavior.

A testcase was updated and performance testing was done to
verify that this does not adversely affect performance.

Change-Id: Iad1cf5ff112e5bfc0514f0805be90e24095e932b
Reviewed-on: https://go-review.googlesource.com/c/go/+/485056
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
2023-04-21 16:47:45 +00:00
Paul E. Murphy db32aba508 internal/bytealg: rewrite indexbytebody on PPC64
Use P8 instructions throughout to be backwards compatible, but
otherwise not impede performance. Use overlapping loads where
possible, and prioritize larger checks over smaller check.

However, some newer instructions can be used surgically when
targeting a newer GOPPC64. These can lead to noticeable
performance improvements with minimal impact to readability.

All tests run below on a Power10/ppc64le, and use a small
modification to BenchmarkIndexByte to ensure the IndexByte
wrapper call is inlined (as it likely is under realistic usage).
This wrapper adds substantial overhead if not inlined.

Previous (power9 path, GOPPC64=power8) vs. GOPPC64=power8:

IndexByte/1       3.81ns ± 8%     3.11ns ± 5%  -18.39%
IndexByte/2       3.82ns ± 3%     3.20ns ± 6%  -16.23%
IndexByte/3       3.61ns ± 4%     3.25ns ± 6%  -10.13%
IndexByte/4       3.66ns ± 5%     3.08ns ± 1%  -15.91%
IndexByte/5       3.82ns ± 0%     3.75ns ± 2%   -1.94%
IndexByte/6       3.83ns ± 0%     3.87ns ± 4%   +1.04%
IndexByte/7       3.83ns ± 0%     3.82ns ± 0%   -0.27%
IndexByte/8       3.82ns ± 0%     2.92ns ±11%  -23.70%
IndexByte/9       3.70ns ± 2%     3.08ns ± 2%  -16.87%
IndexByte/10      3.74ns ± 2%     3.04ns ± 0%  -18.75%
IndexByte/11      3.75ns ± 0%     3.31ns ± 8%  -11.79%
IndexByte/12      3.74ns ± 0%     3.04ns ± 0%  -18.86%
IndexByte/13      3.83ns ± 4%     3.04ns ± 0%  -20.64%
IndexByte/14      3.80ns ± 1%     3.30ns ± 8%  -13.18%
IndexByte/15      3.77ns ± 1%     3.04ns ± 0%  -19.33%
IndexByte/16      3.81ns ± 0%     2.78ns ± 7%  -26.88%
IndexByte/17      4.12ns ± 0%     3.04ns ± 1%  -26.11%
IndexByte/18      4.27ns ± 6%     3.05ns ± 0%  -28.64%
IndexByte/19      4.30ns ± 4%     3.02ns ± 2%  -29.65%
IndexByte/20      4.43ns ± 7%     3.45ns ± 7%  -22.15%
IndexByte/21      4.12ns ± 0%     3.03ns ± 1%  -26.35%
IndexByte/22      4.40ns ± 6%     3.05ns ± 0%  -30.82%
IndexByte/23      4.40ns ± 6%     3.01ns ± 2%  -31.48%
IndexByte/24      4.32ns ± 5%     3.07ns ± 0%  -28.98%
IndexByte/25      4.76ns ± 2%     3.04ns ± 1%  -36.11%
IndexByte/26      4.82ns ± 0%     3.05ns ± 0%  -36.66%
IndexByte/27      4.82ns ± 0%     2.97ns ± 3%  -38.39%
IndexByte/28      4.82ns ± 0%     2.96ns ± 3%  -38.57%
IndexByte/29      4.82ns ± 0%     3.34ns ± 9%  -30.71%
IndexByte/30      4.82ns ± 0%     3.05ns ± 0%  -36.77%
IndexByte/31      4.81ns ± 0%     3.05ns ± 0%  -36.70%
IndexByte/32      3.52ns ± 0%     3.44ns ± 1%   -2.15%
IndexByte/33      4.77ns ± 1%     3.35ns ± 0%  -29.81%
IndexByte/34      5.01ns ± 5%     3.35ns ± 0%  -33.15%
IndexByte/35      4.92ns ± 9%     3.35ns ± 0%  -31.89%
IndexByte/36      4.81ns ± 5%     3.35ns ± 0%  -30.37%
IndexByte/37      4.99ns ± 6%     3.35ns ± 0%  -32.86%
IndexByte/38      5.06ns ± 5%     3.35ns ± 0%  -33.84%
IndexByte/39      5.02ns ± 5%     3.48ns ± 9%  -30.58%
IndexByte/40      5.21ns ± 9%     3.55ns ± 4%  -31.82%
IndexByte/41      5.18ns ± 0%     3.42ns ± 2%  -33.98%
IndexByte/42      5.19ns ± 0%     3.55ns ±11%  -31.56%
IndexByte/43      5.18ns ± 0%     3.45ns ± 5%  -33.46%
IndexByte/44      5.18ns ± 0%     3.39ns ± 0%  -34.56%
IndexByte/45      5.18ns ± 0%     3.43ns ± 4%  -33.74%
IndexByte/46      5.18ns ± 0%     3.47ns ± 1%  -33.03%
IndexByte/47      5.18ns ± 0%     3.44ns ± 2%  -33.54%
IndexByte/48      5.18ns ± 0%     3.39ns ± 0%  -34.52%
IndexByte/49      5.69ns ± 0%     3.79ns ± 0%  -33.45%
IndexByte/50      5.70ns ± 0%     3.70ns ± 3%  -34.98%
IndexByte/51      5.70ns ± 0%     3.70ns ± 2%  -35.05%
IndexByte/52      5.69ns ± 0%     3.80ns ± 1%  -33.35%
IndexByte/53      5.69ns ± 0%     3.78ns ± 0%  -33.54%
IndexByte/54      5.69ns ± 0%     3.78ns ± 1%  -33.51%
IndexByte/55      5.69ns ± 0%     3.78ns ± 0%  -33.61%
IndexByte/56      5.69ns ± 0%     3.81ns ± 3%  -33.12%
IndexByte/57      6.20ns ± 0%     3.79ns ± 4%  -38.89%
IndexByte/58      6.20ns ± 0%     3.74ns ± 2%  -39.58%
IndexByte/59      6.20ns ± 0%     3.69ns ± 2%  -40.47%
IndexByte/60      6.20ns ± 0%     3.79ns ± 1%  -38.81%
IndexByte/61      6.20ns ± 0%     3.77ns ± 1%  -39.23%
IndexByte/62      6.20ns ± 0%     3.79ns ± 0%  -38.89%
IndexByte/63      6.20ns ± 0%     3.79ns ± 0%  -38.90%
IndexByte/64      4.17ns ± 0%     3.47ns ± 3%  -16.70%
IndexByte/65      5.38ns ± 0%     4.21ns ± 0%  -21.59%
IndexByte/66      5.38ns ± 0%     4.21ns ± 0%  -21.58%
IndexByte/67      5.38ns ± 0%     4.22ns ± 0%  -21.58%
IndexByte/68      5.38ns ± 0%     4.22ns ± 0%  -21.59%
IndexByte/69      5.38ns ± 0%     4.22ns ± 0%  -21.56%
IndexByte/70      5.38ns ± 0%     4.21ns ± 0%  -21.59%
IndexByte/71      5.37ns ± 0%     4.21ns ± 0%  -21.51%
IndexByte/72      5.37ns ± 0%     4.22ns ± 0%  -21.46%
IndexByte/73      5.71ns ± 0%     4.22ns ± 0%  -26.20%
IndexByte/74      5.71ns ± 0%     4.21ns ± 0%  -26.21%
IndexByte/75      5.71ns ± 0%     4.21ns ± 0%  -26.17%
IndexByte/76      5.71ns ± 0%     4.22ns ± 0%  -26.22%
IndexByte/77      5.71ns ± 0%     4.22ns ± 0%  -26.22%
IndexByte/78      5.71ns ± 0%     4.21ns ± 0%  -26.22%
IndexByte/79      5.71ns ± 0%     4.22ns ± 0%  -26.21%
IndexByte/80      5.71ns ± 0%     4.21ns ± 0%  -26.19%
IndexByte/81      6.20ns ± 0%     4.39ns ± 0%  -29.13%
IndexByte/82      6.20ns ± 0%     4.36ns ± 0%  -29.67%
IndexByte/83      6.20ns ± 0%     4.36ns ± 0%  -29.63%
IndexByte/84      6.20ns ± 0%     4.39ns ± 0%  -29.21%
IndexByte/85      6.20ns ± 0%     4.36ns ± 0%  -29.64%
IndexByte/86      6.20ns ± 0%     4.36ns ± 0%  -29.63%
IndexByte/87      6.20ns ± 0%     4.39ns ± 0%  -29.21%
IndexByte/88      6.20ns ± 0%     4.36ns ± 0%  -29.65%
IndexByte/89      6.74ns ± 0%     4.36ns ± 0%  -35.33%
IndexByte/90      6.75ns ± 0%     4.37ns ± 0%  -35.22%
IndexByte/91      6.74ns ± 0%     4.36ns ± 0%  -35.30%
IndexByte/92      6.74ns ± 0%     4.36ns ± 0%  -35.34%
IndexByte/93      6.74ns ± 0%     4.37ns ± 0%  -35.20%
IndexByte/94      6.74ns ± 0%     4.36ns ± 0%  -35.33%
IndexByte/95      6.75ns ± 0%     4.36ns ± 0%  -35.32%
IndexByte/96      4.83ns ± 0%     4.34ns ± 2%  -10.24%
IndexByte/97      5.91ns ± 0%     4.65ns ± 0%  -21.24%
IndexByte/98      5.91ns ± 0%     4.65ns ± 0%  -21.24%
IndexByte/99      5.91ns ± 0%     4.65ns ± 0%  -21.23%
IndexByte/100     5.90ns ± 0%     4.65ns ± 0%  -21.21%
IndexByte/101     5.90ns ± 0%     4.65ns ± 0%  -21.22%
IndexByte/102     5.90ns ± 0%     4.65ns ± 0%  -21.23%
IndexByte/103     5.91ns ± 0%     4.65ns ± 0%  -21.23%
IndexByte/104     5.91ns ± 0%     4.65ns ± 0%  -21.24%
IndexByte/105     6.25ns ± 0%     4.65ns ± 0%  -25.59%
IndexByte/106     6.25ns ± 0%     4.65ns ± 0%  -25.59%
IndexByte/107     6.25ns ± 0%     4.65ns ± 0%  -25.60%
IndexByte/108     6.25ns ± 0%     4.65ns ± 0%  -25.58%
IndexByte/109     6.24ns ± 0%     4.65ns ± 0%  -25.50%
IndexByte/110     6.25ns ± 0%     4.65ns ± 0%  -25.56%
IndexByte/111     6.25ns ± 0%     4.65ns ± 0%  -25.60%
IndexByte/112     6.25ns ± 0%     4.65ns ± 0%  -25.59%
IndexByte/113     6.76ns ± 0%     5.05ns ± 0%  -25.37%
IndexByte/114     6.76ns ± 0%     5.05ns ± 0%  -25.31%
IndexByte/115     6.76ns ± 0%     5.05ns ± 0%  -25.38%
IndexByte/116     6.76ns ± 0%     5.05ns ± 0%  -25.31%
IndexByte/117     6.76ns ± 0%     5.05ns ± 0%  -25.38%
IndexByte/118     6.76ns ± 0%     5.05ns ± 0%  -25.31%
IndexByte/119     6.76ns ± 0%     5.05ns ± 0%  -25.38%
IndexByte/120     6.76ns ± 0%     5.05ns ± 0%  -25.36%
IndexByte/121     7.35ns ± 0%     5.05ns ± 0%  -31.33%
IndexByte/122     7.36ns ± 0%     5.05ns ± 0%  -31.42%
IndexByte/123     7.38ns ± 0%     5.05ns ± 0%  -31.60%
IndexByte/124     7.38ns ± 0%     5.05ns ± 0%  -31.59%
IndexByte/125     7.38ns ± 0%     5.05ns ± 0%  -31.60%
IndexByte/126     7.38ns ± 0%     5.05ns ± 0%  -31.58%
IndexByte/128     5.28ns ± 0%     5.10ns ± 0%   -3.41%
IndexByte/256     7.27ns ± 0%     7.28ns ± 2%   +0.13%
IndexByte/512     12.1ns ± 0%     11.8ns ± 0%   -2.51%
IndexByte/1K      23.1ns ± 3%     22.0ns ± 0%   -4.66%
IndexByte/2K      42.6ns ± 0%     42.4ns ± 0%   -0.41%
IndexByte/4K      90.3ns ± 0%     89.4ns ± 0%   -0.98%
IndexByte/8K       170ns ± 0%      170ns ± 0%   -0.59%
IndexByte/16K      331ns ± 0%      330ns ± 0%   -0.27%
IndexByte/32K      660ns ± 0%      660ns ± 0%   -0.08%
IndexByte/64K     1.30µs ± 0%     1.30µs ± 0%   -0.08%
IndexByte/128K    2.58µs ± 0%     2.58µs ± 0%   -0.04%
IndexByte/256K    5.15µs ± 0%     5.15µs ± 0%   -0.04%
IndexByte/512K    10.3µs ± 0%     10.3µs ± 0%   -0.03%
IndexByte/1M      20.6µs ± 0%     20.5µs ± 0%   -0.03%
IndexByte/2M      41.1µs ± 0%     41.1µs ± 0%   -0.03%
IndexByte/4M      82.2µs ± 0%     82.1µs ± 0%   -0.02%
IndexByte/8M       164µs ± 0%      164µs ± 0%   -0.01%
IndexByte/16M      328µs ± 0%      328µs ± 0%   -0.01%
IndexByte/32M      657µs ± 0%      657µs ± 0%   -0.00%

GOPPC64=power8 vs GOPPC64=power9. The Improvement is
most noticed between 16 and 64B, and goes away around
128B.

IndexByte/16      2.78ns ± 7%     2.65ns ±15%   -4.74%
IndexByte/17      3.04ns ± 1%     2.80ns ± 3%   -7.85%
IndexByte/18      3.05ns ± 0%     2.71ns ± 4%  -11.00%
IndexByte/19      3.02ns ± 2%     2.76ns ±10%   -8.74%
IndexByte/20      3.45ns ± 7%     2.91ns ± 0%  -15.46%
IndexByte/21      3.03ns ± 1%     2.84ns ± 9%   -6.33%
IndexByte/22      3.05ns ± 0%     2.67ns ± 1%  -12.38%
IndexByte/23      3.01ns ± 2%     2.67ns ± 1%  -11.24%
IndexByte/24      3.07ns ± 0%     2.92ns ±12%   -4.79%
IndexByte/25      3.04ns ± 1%     3.15ns ±15%   +3.63%
IndexByte/26      3.05ns ± 0%     2.83ns ±13%   -7.33%
IndexByte/27      2.97ns ± 3%     2.98ns ±10%   +0.56%
IndexByte/28      2.96ns ± 3%     2.96ns ± 9%   -0.05%
IndexByte/29      3.34ns ± 9%     3.03ns ±12%   -9.33%
IndexByte/30      3.05ns ± 0%     2.68ns ± 1%  -12.05%
IndexByte/31      3.05ns ± 0%     2.83ns ±12%   -7.27%
IndexByte/32      3.44ns ± 1%     3.21ns ±10%   -6.78%
IndexByte/33      3.35ns ± 0%     3.41ns ± 2%   +1.95%
IndexByte/34      3.35ns ± 0%     3.13ns ± 0%   -6.53%
IndexByte/35      3.35ns ± 0%     3.13ns ± 0%   -6.54%
IndexByte/36      3.35ns ± 0%     3.13ns ± 0%   -6.52%
IndexByte/37      3.35ns ± 0%     3.13ns ± 0%   -6.52%
IndexByte/38      3.35ns ± 0%     3.24ns ± 4%   -3.30%
IndexByte/39      3.48ns ± 9%     3.44ns ± 2%   -1.19%
IndexByte/40      3.55ns ± 4%     3.46ns ± 2%   -2.44%
IndexByte/41      3.42ns ± 2%     3.39ns ± 4%   -0.86%
IndexByte/42      3.55ns ±11%     3.46ns ± 1%   -2.65%
IndexByte/43      3.45ns ± 5%     3.44ns ± 2%   -0.31%
IndexByte/44      3.39ns ± 0%     3.43ns ± 3%   +1.23%
IndexByte/45      3.43ns ± 4%     3.50ns ± 1%   +2.07%
IndexByte/46      3.47ns ± 1%     3.46ns ± 2%   -0.31%
IndexByte/47      3.44ns ± 2%     3.47ns ± 1%   +0.78%
IndexByte/48      3.39ns ± 0%     3.46ns ± 2%   +1.96%
IndexByte/49      3.79ns ± 0%     3.47ns ± 0%   -8.41%
IndexByte/50      3.70ns ± 3%     3.64ns ± 5%   -1.66%
IndexByte/51      3.70ns ± 2%     3.75ns ± 0%   +1.40%
IndexByte/52      3.80ns ± 1%     3.77ns ± 0%   -0.70%
IndexByte/53      3.78ns ± 0%     3.77ns ± 0%   -0.46%
IndexByte/54      3.78ns ± 1%     3.53ns ± 7%   -6.74%
IndexByte/55      3.78ns ± 0%     3.47ns ± 0%   -8.17%
IndexByte/56      3.81ns ± 3%     3.45ns ± 0%   -9.43%
IndexByte/57      3.79ns ± 4%     3.47ns ± 0%   -8.45%
IndexByte/58      3.74ns ± 2%     3.55ns ± 4%   -5.16%
IndexByte/59      3.69ns ± 2%     3.61ns ± 4%   -2.01%
IndexByte/60      3.79ns ± 1%     3.45ns ± 0%   -9.09%
IndexByte/61      3.77ns ± 1%     3.47ns ± 0%   -7.93%
IndexByte/62      3.79ns ± 0%     3.45ns ± 0%   -8.97%
IndexByte/63      3.79ns ± 0%     3.47ns ± 0%   -8.44%
IndexByte/64      3.47ns ± 3%     3.18ns ± 0%   -8.41%

GOPPC64=power9 vs GOPPC64=power10. Only sizes <16 will
show meaningful changes.

IndexByte/1       3.27ns ± 8%     2.36ns ± 2%  -27.58%
IndexByte/2       3.06ns ± 4%     2.34ns ± 1%  -23.42%
IndexByte/3       3.77ns ±11%     2.48ns ± 7%  -34.03%
IndexByte/4       3.18ns ± 8%     2.33ns ± 1%  -26.69%
IndexByte/5       3.18ns ± 5%     2.34ns ± 4%  -26.26%
IndexByte/6       3.13ns ± 3%     2.35ns ± 1%  -24.97%
IndexByte/7       3.25ns ± 1%     2.33ns ± 1%  -28.22%
IndexByte/8       2.79ns ± 2%     2.36ns ± 1%  -15.32%
IndexByte/9       2.90ns ± 0%     2.34ns ± 2%  -19.36%
IndexByte/10      2.99ns ± 3%     2.31ns ± 1%  -22.70%
IndexByte/11      3.13ns ± 7%     2.31ns ± 0%  -26.08%
IndexByte/12      3.01ns ± 4%     2.32ns ± 1%  -22.91%
IndexByte/13      2.98ns ± 3%     2.31ns ± 1%  -22.72%
IndexByte/14      2.92ns ± 2%     2.61ns ±16%  -10.58%
IndexByte/15      3.02ns ± 5%     2.69ns ± 7%  -10.90%
IndexByte/16      2.65ns ±15%     2.29ns ± 1%  -13.61%

Change-Id: I4482f762d25eabf60def4981a0b2bc0c10ccf50c
Reviewed-on: https://go-review.googlesource.com/c/go/+/478656
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
Reviewed-by: Archana Ravindar <aravind5@in.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-04-21 16:10:29 +00:00
Archana R 56c4422770 runtime: improve index on ppc64x/power10
Rewrite index asm function to use the new power10 instruction lxvl,
stxvl or the load, store vector with length which can specify the
number of bytes to be stored in a register. This avoids the need to
create a separator mask and extra AND instructions. It also allows
us to process the tail end of the string using a lot fewer instructions
as we can load bytes of separator length directly rather than loading
16 bytes and masking out bytes that are greater than separator length
On power9 and power8 the code remains unchanged.
The performance for smaller sizes improve the most, on larger sizes
we see minimal improvement.

name          old time/op    new time/op     delta
Index/10          10.6ns ± 3%     9.8ns ± 2%   -7.20%
Index/11          11.2ns ± 4%    10.6ns ± 0%   -5.99%
Index/12          12.7ns ± 3%    11.3ns ± 0%  -11.21%
Index/13          13.5ns ± 2%    11.7ns ± 0%  -13.11%
Index/14          14.1ns ± 1%    12.0ns ± 0%  -14.43%
Index/15          14.3ns ± 2%    12.4ns ± 0%  -13.39%
Index/16          14.5ns ± 1%    12.7ns ± 0%  -12.57%
Index/17          26.7ns ± 0%    25.9ns ± 0%   -2.99%
Index/18          27.3ns ± 0%    26.4ns ± 1%   -3.35%
Index/19          35.7ns ±16%    26.1ns ± 1%  -26.87%
Index/20          29.4ns ± 0%    27.3ns ± 1%   -7.06%
Index/21          29.3ns ± 0%    26.9ns ± 1%   -8.37%
Index/22          30.0ns ± 0%    27.4ns ± 0%   -8.68%
Index/23          29.9ns ± 0%    27.7ns ± 0%   -7.15%
Index/24          31.0ns ± 0%    28.0ns ± 0%   -9.92%
Index/25          31.7ns ± 0%    28.4ns ± 0%  -10.54%
Index/26          30.6ns ± 0%    28.9ns ± 1%   -5.67%
Index/27          31.4ns ± 0%    29.3ns ± 0%   -6.71%
Index/28          32.7ns ± 0%    29.6ns ± 1%   -9.36%
Index/29          33.3ns ± 0%    30.1ns ± 1%   -9.70%
Index/30          32.4ns ± 0%    30.7ns ± 0%   -5.23%
Index/31          33.2ns ± 0%    30.6ns ± 1%   -7.83%
Index/32          34.3ns ± 0%    30.9ns ± 0%   -9.94%
Index/64          46.8ns ± 0%    44.2ns ± 0%   -5.66%
Index/128         71.2ns ± 0%    67.3ns ± 0%   -5.43%
Index/256          129ns ± 0%     127ns ± 0%   -1.67%
Index/2K           838ns ± 0%     804ns ± 0%   -4.03%
Index/4K          1.65µs ± 0%    1.58µs ± 0%   -4.25%
Index/2M           829µs ± 0%     793µs ± 0%   -4.42%
Index/4M          1.65ms ± 0%    1.59ms ± 0%   -4.19%
Index/64M         26.5ms ± 0%    25.4ms ± 0%   -4.18%
IndexHard2         412µs ± 0%     396µs ± 0%   -3.76%
IndexEasy/10      10.0ns ± 0%     9.3ns ± 1%   -7.20%
IndexEasy/11      10.8ns ± 1%    11.0ns ± 1%   +2.22%
IndexEasy/12      12.3ns ± 2%    11.5ns ± 1%   -6.37%
IndexEasy/13      13.1ns ± 0%    11.7ns ± 2%  -10.83%
IndexEasy/14      13.8ns ± 2%    11.9ns ± 1%  -13.52%
IndexEasy/15      14.0ns ± 2%    12.4ns ± 2%  -11.46%
IndexEasy/16      14.3ns ± 1%    12.5ns ± 0%  -12.40%
CountHard2         415µs ± 0%     396µs ± 0%   -4.48%

Change-Id: Id3efa5ed9c662a29f58125c7f866a09f29a59b6c
Reviewed-on: https://go-review.googlesource.com/c/go/+/478918
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2023-04-18 15:42:18 +00:00
WANG Xuerui 157999f512 internal/bytealg, runtime: align some loong64 asm loops to 16-byte boundaries
The LA464 micro-architecture is very sensitive to alignment of loops,
so the final performance of linked binaries can vary wildly due to
uncontrolled alignment of certain performance-critical loops. Now that
PCALIGN is available on loong64, let's make use of it and manually align
some assembly loops. The functions are identified based on perf records
of some easily regressed go1 benchmark cases (e.g. FmtFprintfPrefixedInt,
RegexpMatchEasy0_1K and Revcomp are particularly sensitive; even those
optimizations purely reducing dynamic instruction counts can regress
those cases by 6~12%, making the numbers almost useless).

Benchmark results on Loongson 3A5000 (which is an LA464 implementation):

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 416154  │               this CL               │
                      │   sec/op    │   sec/op     vs base                │
BinaryTree17             14.10 ± 1%    14.10 ± 1%        ~ (p=1.000 n=10)
Fannkuch11               3.672 ± 0%    3.579 ± 0%   -2.53% (p=0.000 n=10)
FmtFprintfEmpty         94.72n ± 0%   94.73n ± 0%   +0.01% (p=0.000 n=10)
FmtFprintfString        149.9n ± 0%   151.9n ± 0%   +1.33% (p=0.000 n=10)
FmtFprintfInt           154.1n ± 0%   158.3n ± 0%   +2.73% (p=0.000 n=10)
FmtFprintfIntInt        236.2n ± 0%   241.4n ± 0%   +2.20% (p=0.000 n=10)
FmtFprintfPrefixedInt   314.2n ± 0%   320.2n ± 0%   +1.91% (p=0.000 n=10)
FmtFprintfFloat         405.0n ± 0%   414.3n ± 0%   +2.30% (p=0.000 n=10)
FmtManyArgs             933.6n ± 0%   949.9n ± 0%   +1.75% (p=0.000 n=10)
GobDecode               15.51m ± 1%   15.24m ± 0%   -1.77% (p=0.000 n=10)
GobEncode               18.42m ± 4%   18.10m ± 2%        ~ (p=0.631 n=10)
Gzip                    423.6m ± 0%   429.9m ± 0%   +1.49% (p=0.000 n=10)
Gunzip                  88.75m ± 0%   88.31m ± 0%   -0.50% (p=0.000 n=10)
HTTPClientServer        85.44µ ± 0%   85.71µ ± 0%   +0.31% (p=0.035 n=10)
JSONEncode              18.65m ± 0%   19.74m ± 0%   +5.81% (p=0.000 n=10)
JSONDecode              77.75m ± 0%   78.60m ± 1%   +1.09% (p=0.000 n=10)
Mandelbrot200           7.214m ± 0%   7.208m ± 0%        ~ (p=0.481 n=10)
GoParse                 7.616m ± 2%   7.616m ± 1%        ~ (p=0.739 n=10)
RegexpMatchEasy0_32     142.9n ± 0%   133.0n ± 0%   -6.93% (p=0.000 n=10)
RegexpMatchEasy0_1K     1.535µ ± 0%   1.362µ ± 0%  -11.27% (p=0.000 n=10)
RegexpMatchEasy1_32     161.8n ± 0%   161.8n ± 0%        ~ (p=0.628 n=10)
RegexpMatchEasy1_1K     1.635µ ± 0%   1.497µ ± 0%   -8.41% (p=0.000 n=10)
RegexpMatchMedium_32    1.429µ ± 0%   1.420µ ± 0%   -0.63% (p=0.000 n=10)
RegexpMatchMedium_1K    41.86µ ± 0%   42.25µ ± 0%   +0.93% (p=0.000 n=10)
RegexpMatchHard_32      2.144µ ± 0%   2.108µ ± 0%   -1.68% (p=0.000 n=10)
RegexpMatchHard_1K      63.83µ ± 0%   62.65µ ± 0%   -1.86% (p=0.000 n=10)
Revcomp                  1.337 ± 0%    1.192 ± 0%  -10.89% (p=0.000 n=10)
Template                116.4m ± 1%   115.6m ± 2%        ~ (p=0.579 n=10)
TimeParse               421.4n ± 2%   418.1n ± 1%   -0.78% (p=0.001 n=10)
TimeFormat              515.1n ± 0%   517.9n ± 0%   +0.54% (p=0.001 n=10)
geomean                 104.5µ        103.5µ        -0.99%

                     │  CL 416154   │               this CL                │
                     │     B/s      │     B/s       vs base                │
GobDecode              47.19Mi ± 1%   48.04Mi ± 0%   +1.80% (p=0.000 n=10)
GobEncode              39.73Mi ± 4%   40.44Mi ± 2%        ~ (p=0.631 n=10)
Gzip                   43.68Mi ± 0%   43.04Mi ± 0%   -1.47% (p=0.000 n=10)
Gunzip                 208.5Mi ± 0%   209.6Mi ± 0%   +0.50% (p=0.000 n=10)
JSONEncode             99.21Mi ± 0%   93.76Mi ± 0%   -5.49% (p=0.000 n=10)
JSONDecode             23.80Mi ± 0%   23.55Mi ± 1%   -1.08% (p=0.000 n=10)
GoParse                7.253Mi ± 2%   7.253Mi ± 1%        ~ (p=0.810 n=10)
RegexpMatchEasy0_32    213.6Mi ± 0%   229.4Mi ± 0%   +7.41% (p=0.000 n=10)
RegexpMatchEasy0_1K    636.3Mi ± 0%   717.3Mi ± 0%  +12.73% (p=0.000 n=10)
RegexpMatchEasy1_32    188.6Mi ± 0%   188.6Mi ± 0%        ~ (p=0.810 n=10)
RegexpMatchEasy1_1K    597.4Mi ± 0%   652.2Mi ± 0%   +9.17% (p=0.000 n=10)
RegexpMatchMedium_32   21.35Mi ± 0%   21.49Mi ± 0%   +0.63% (p=0.000 n=10)
RegexpMatchMedium_1K   23.33Mi ± 0%   23.11Mi ± 0%   -0.94% (p=0.000 n=10)
RegexpMatchHard_32     14.24Mi ± 0%   14.48Mi ± 0%   +1.67% (p=0.000 n=10)
RegexpMatchHard_1K     15.30Mi ± 0%   15.59Mi ± 0%   +1.93% (p=0.000 n=10)
Revcomp                181.3Mi ± 0%   203.4Mi ± 0%  +12.21% (p=0.000 n=10)
Template               15.89Mi ± 1%   16.00Mi ± 2%        ~ (p=0.542 n=10)
geomean                59.33Mi        60.72Mi        +2.33%

Change-Id: I9ac28d936e03d21c46bb19fa100018f61ace6b42
Reviewed-on: https://go-review.googlesource.com/c/go/+/479816
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: WANG Xuerui <git@xen0n.name>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2023-04-07 20:21:15 +00:00
Paul E. Murphy 5f1a0320b9 internal/bytealg: rewrite PPC64 Compare
Merge the P8 and P9 paths into one. This removes the need for
a runtime CPU check and maintaining two separate code paths.

This takes advantage of overlapping checks, and the P9 SETB
(emulated with little overhead on P8) to speed up comparisons
of small strings.

Similarly, the SETB instruction can be used on GOPPC64=power9
which provides a small speedup over using a couple ISELs. This
only accounts for a few percent on very small strings, thus
results of running P8 codegen on P9 are left out.

For the baseline on a power8 machine:

BytesCompare/1     7.76ns ± 0%  6.38ns ± 0%  -17.71%
BytesCompare/2     7.77ns ± 0%  6.36ns ± 0%  -18.12%
BytesCompare/3     7.56ns ± 0%  6.36ns ± 0%  -15.79%
BytesCompare/4     7.76ns ± 0%  5.74ns ± 0%  -25.99%
BytesCompare/5     7.48ns ± 0%  5.74ns ± 0%  -23.29%
BytesCompare/6     7.56ns ± 0%  5.74ns ± 0%  -24.06%
BytesCompare/7     7.14ns ± 0%  5.74ns ± 0%  -19.63%
BytesCompare/8     5.58ns ± 0%  5.19ns ± 0%   -7.03%
BytesCompare/9     7.85ns ± 0%  5.19ns ± 0%  -33.86%
BytesCompare/10    7.87ns ± 0%  5.19ns ± 0%  -34.06%
BytesCompare/11    7.59ns ± 0%  5.19ns ± 0%  -31.59%
BytesCompare/12    7.87ns ± 0%  5.19ns ± 0%  -34.02%
BytesCompare/13    7.55ns ± 0%  5.19ns ± 0%  -31.24%
BytesCompare/14    7.47ns ± 0%  5.19ns ± 0%  -30.53%
BytesCompare/15    7.88ns ± 0%  5.19ns ± 0%  -34.09%
BytesCompare/16    6.07ns ± 0%  5.58ns ± 0%   -8.08%
BytesCompare/17    9.05ns ± 0%  5.62ns ± 0%  -37.94%
BytesCompare/18    8.95ns ± 0%  5.62ns ± 0%  -37.24%
BytesCompare/19    8.49ns ± 0%  5.62ns ± 0%  -33.81%
BytesCompare/20    9.07ns ± 0%  5.62ns ± 0%  -38.05%
BytesCompare/21    8.69ns ± 0%  5.62ns ± 0%  -35.37%
BytesCompare/22    8.57ns ± 0%  5.62ns ± 0%  -34.43%
BytesCompare/23    8.31ns ± 0%  5.62ns ± 0%  -32.38%
BytesCompare/24    8.42ns ± 0%  5.62ns ± 0%  -33.23%
BytesCompare/25    9.70ns ± 0%  5.56ns ± 0%  -42.69%
BytesCompare/26    9.53ns ± 0%  5.56ns ± 0%  -41.66%
BytesCompare/27    9.29ns ± 0%  5.56ns ± 0%  -40.15%
BytesCompare/28    9.53ns ± 0%  5.56ns ± 0%  -41.65%
BytesCompare/29    9.37ns ± 0%  5.56ns ± 0%  -40.63%
BytesCompare/30    9.17ns ± 0%  5.56ns ± 0%  -39.36%
BytesCompare/31    9.07ns ± 0%  5.56ns ± 0%  -38.71%
BytesCompare/32    5.81ns ± 0%  5.49ns ± 0%   -5.49%
BytesCompare/33    9.36ns ± 0%  5.32ns ± 0%  -43.17%
BytesCompare/34    9.44ns ± 0%  5.32ns ± 0%  -43.68%
BytesCompare/35    8.91ns ± 0%  5.32ns ± 0%  -40.29%
BytesCompare/36    9.45ns ± 0%  5.32ns ± 0%  -43.71%
BytesCompare/37    8.94ns ± 0%  5.32ns ± 0%  -40.53%
BytesCompare/38    9.08ns ± 0%  5.32ns ± 0%  -41.44%
BytesCompare/39    8.62ns ± 0%  5.32ns ± 0%  -38.33%
BytesCompare/40    7.93ns ± 0%  5.32ns ± 0%  -32.93%
BytesCompare/41    10.1ns ± 0%   5.3ns ± 0%  -47.08%
BytesCompare/42    10.1ns ± 0%   5.3ns ± 0%  -47.43%
BytesCompare/43    9.80ns ± 0%  5.32ns ± 0%  -45.66%
BytesCompare/44    10.3ns ± 0%   5.3ns ± 0%  -48.26%
BytesCompare/45    9.88ns ± 0%  5.33ns ± 0%  -46.08%
BytesCompare/46    9.82ns ± 0%  5.32ns ± 0%  -45.81%
BytesCompare/47    9.73ns ± 0%  5.33ns ± 0%  -45.25%
BytesCompare/48    8.31ns ± 0%  5.22ns ± 0%  -37.19%
BytesCompare/49    11.2ns ± 0%   5.2ns ± 0%  -53.28%
BytesCompare/50    11.1ns ± 0%   5.2ns ± 0%  -52.86%
BytesCompare/51    10.8ns ± 0%   5.2ns ± 0%  -51.37%
BytesCompare/52    11.1ns ± 0%   5.2ns ± 0%  -52.94%
BytesCompare/53    10.8ns ± 0%   5.2ns ± 0%  -51.50%
BytesCompare/54    10.7ns ± 0%   5.2ns ± 0%  -51.09%
BytesCompare/55    10.3ns ± 0%   5.2ns ± 0%  -49.49%
BytesCompare/56    10.9ns ± 0%   5.2ns ± 0%  -51.73%
BytesCompare/57    12.2ns ± 0%   5.3ns ± 0%  -56.92%
BytesCompare/58    12.2ns ± 0%   5.3ns ± 0%  -56.81%
BytesCompare/59    11.5ns ± 0%   5.3ns ± 0%  -54.45%
BytesCompare/60    12.1ns ± 0%   5.3ns ± 0%  -56.67%
BytesCompare/61    11.7ns ± 0%   5.3ns ± 0%  -54.96%
BytesCompare/62    11.9ns ± 0%   5.3ns ± 0%  -55.76%
BytesCompare/63    11.4ns ± 0%   5.3ns ± 0%  -53.73%
BytesCompare/64    6.08ns ± 0%  5.47ns ± 0%   -9.96%
BytesCompare/65    9.87ns ± 0%  5.96ns ± 0%  -39.57%
BytesCompare/66    9.81ns ± 0%  5.96ns ± 0%  -39.25%
BytesCompare/67    9.49ns ± 0%  5.96ns ± 0%  -37.18%
BytesCompare/68    9.81ns ± 0%  5.96ns ± 0%  -39.26%
BytesCompare/69    9.44ns ± 0%  5.96ns ± 0%  -36.84%
BytesCompare/70    9.58ns ± 0%  5.96ns ± 0%  -37.75%
BytesCompare/71    9.24ns ± 0%  5.96ns ± 0%  -35.50%
BytesCompare/72    8.26ns ± 0%  5.94ns ± 0%  -28.09%
BytesCompare/73    10.6ns ± 0%   5.9ns ± 0%  -43.70%
BytesCompare/74    10.6ns ± 0%   5.9ns ± 0%  -43.87%
BytesCompare/75    10.2ns ± 0%   5.9ns ± 0%  -41.83%
BytesCompare/76    10.7ns ± 0%   5.9ns ± 0%  -44.55%
BytesCompare/77    10.3ns ± 0%   5.9ns ± 0%  -42.51%
BytesCompare/78    10.3ns ± 0%   5.9ns ± 0%  -42.29%
BytesCompare/79    10.2ns ± 0%   5.9ns ± 0%  -41.95%
BytesCompare/80    8.74ns ± 0%  5.93ns ± 0%  -32.23%
BytesCompare/81    11.7ns ± 0%   6.8ns ± 0%  -41.87%
BytesCompare/82    11.7ns ± 0%   6.8ns ± 0%  -41.54%
BytesCompare/83    11.1ns ± 0%   6.8ns ± 0%  -38.32%
BytesCompare/84    11.7ns ± 0%   6.8ns ± 0%  -41.59%
BytesCompare/85    11.2ns ± 0%   6.8ns ± 0%  -38.93%
BytesCompare/86    11.2ns ± 0%   6.8ns ± 0%  -38.87%
BytesCompare/87    10.8ns ± 0%   6.8ns ± 0%  -37.07%
BytesCompare/88    11.3ns ± 0%   6.7ns ± 0%  -40.57%
BytesCompare/89    12.6ns ± 0%   6.7ns ± 0%  -46.57%
BytesCompare/90    12.6ns ± 0%   6.7ns ± 0%  -46.44%
BytesCompare/91    11.9ns ± 0%   6.7ns ± 0%  -43.66%
BytesCompare/92    12.5ns ± 0%   6.7ns ± 0%  -46.09%
BytesCompare/93    12.2ns ± 0%   6.7ns ± 0%  -44.90%
BytesCompare/94    12.4ns ± 0%   6.7ns ± 0%  -45.62%
BytesCompare/95    11.8ns ± 0%   6.7ns ± 0%  -43.00%
BytesCompare/96    7.25ns ± 0%  6.62ns ± 0%   -8.70%
BytesCompare/97    11.1ns ± 0%   7.2ns ± 0%  -34.98%
BytesCompare/98    10.9ns ± 0%   7.2ns ± 0%  -34.03%
BytesCompare/99    10.4ns ± 0%   7.2ns ± 0%  -31.19%
BytesCompare/100   10.9ns ± 0%   7.2ns ± 0%  -33.97%
BytesCompare/101   10.4ns ± 0%   7.2ns ± 0%  -31.19%
BytesCompare/102   10.7ns ± 0%   7.2ns ± 0%  -32.72%
BytesCompare/103   10.2ns ± 0%   7.2ns ± 0%  -29.28%
BytesCompare/104   9.38ns ± 0%  7.19ns ± 0%  -23.33%
BytesCompare/105   11.7ns ± 0%   7.2ns ± 0%  -38.60%
BytesCompare/106   11.7ns ± 0%   7.2ns ± 0%  -38.28%
BytesCompare/107   11.3ns ± 0%   7.2ns ± 0%  -36.48%
BytesCompare/108   11.7ns ± 0%   7.2ns ± 0%  -38.49%
BytesCompare/109   11.4ns ± 0%   7.2ns ± 0%  -36.76%
BytesCompare/110   11.3ns ± 0%   7.2ns ± 0%  -36.37%
BytesCompare/111   11.1ns ± 0%   7.2ns ± 0%  -35.05%
BytesCompare/112   9.95ns ± 0%  7.19ns ± 0%  -27.71%
BytesCompare/113   12.7ns ± 0%   7.0ns ± 0%  -44.71%
BytesCompare/114   12.6ns ± 0%   7.0ns ± 0%  -44.23%
BytesCompare/115   12.3ns ± 0%   7.0ns ± 0%  -42.83%
BytesCompare/116   12.7ns ± 0%   7.0ns ± 0%  -44.67%
BytesCompare/117   12.2ns ± 0%   7.0ns ± 0%  -42.41%
BytesCompare/118   12.2ns ± 0%   7.0ns ± 0%  -42.50%
BytesCompare/119   11.9ns ± 0%   7.0ns ± 0%  -40.76%
BytesCompare/120   12.3ns ± 0%   7.0ns ± 0%  -43.01%
BytesCompare/121   13.7ns ± 0%   7.0ns ± 0%  -48.55%
BytesCompare/122   13.6ns ± 0%   7.0ns ± 0%  -48.06%
BytesCompare/123   12.9ns ± 0%   7.0ns ± 0%  -45.44%
BytesCompare/124   13.5ns ± 0%   7.0ns ± 0%  -47.91%
BytesCompare/125   13.0ns ± 0%   7.0ns ± 0%  -46.03%
BytesCompare/126   13.2ns ± 0%   7.0ns ± 0%  -46.72%
BytesCompare/127   12.9ns ± 0%   7.0ns ± 0%  -45.36%
BytesCompare/128   7.53ns ± 0%  6.78ns ± 0%   -9.95%
BytesCompare/256   10.1ns ± 0%   9.6ns ± 0%   -4.35%
BytesCompare/512   23.0ns ± 0%  15.3ns ± 0%  -33.30%
BytesCompare/1024  36.4ns ± 0%  32.8ns ± 0%   -9.83%
BytesCompare/2048  62.0ns ± 0%  56.0ns ± 0%   -9.77%

For GOPPC64=power9 on power9:

BytesCompare/1     5.95ns ± 0%  4.83ns ± 0%  -18.89%
BytesCompare/2     6.37ns ± 0%  4.69ns ± 0%  -26.39%
BytesCompare/3     6.87ns ± 0%  4.68ns ± 0%  -31.79%
BytesCompare/4     5.86ns ± 0%  4.63ns ± 0%  -20.98%
BytesCompare/5     5.84ns ± 0%  4.63ns ± 0%  -20.67%
BytesCompare/6     5.84ns ± 0%  4.63ns ± 0%  -20.70%
BytesCompare/7     5.82ns ± 0%  4.63ns ± 0%  -20.40%
BytesCompare/8     5.81ns ± 0%  4.64ns ± 0%  -20.23%
BytesCompare/9     5.83ns ± 0%  4.71ns ± 0%  -19.19%
BytesCompare/10    6.22ns ± 0%  4.71ns ± 0%  -24.32%
BytesCompare/11    6.94ns ± 0%  4.71ns ± 0%  -32.16%
BytesCompare/12    5.77ns ± 0%  4.71ns ± 0%  -18.34%
BytesCompare/13    5.77ns ± 0%  4.71ns ± 0%  -18.44%
BytesCompare/14    5.77ns ± 0%  4.71ns ± 0%  -18.31%
BytesCompare/15    6.31ns ± 0%  4.71ns ± 0%  -25.32%
BytesCompare/16    4.99ns ± 0%  5.03ns ± 0%   +0.72%
BytesCompare/17    5.07ns ± 0%  5.03ns ± 0%   -0.87%
BytesCompare/18    5.07ns ± 0%  5.03ns ± 0%   -0.81%
BytesCompare/19    5.07ns ± 0%  5.03ns ± 0%   -0.85%
BytesCompare/20    5.07ns ± 0%  5.03ns ± 0%   -0.73%
BytesCompare/21    5.07ns ± 0%  5.03ns ± 0%   -0.81%
BytesCompare/22    5.07ns ± 0%  5.03ns ± 0%   -0.77%
BytesCompare/23    5.07ns ± 0%  5.03ns ± 0%   -0.75%
BytesCompare/24    5.08ns ± 0%  5.07ns ± 0%   -0.12%
BytesCompare/25    5.03ns ± 0%  5.00ns ± 0%   -0.60%
BytesCompare/26    5.02ns ± 0%  5.00ns ± 0%   -0.56%
BytesCompare/27    5.03ns ± 0%  5.00ns ± 0%   -0.60%
BytesCompare/28    5.03ns ± 0%  5.00ns ± 0%   -0.72%
BytesCompare/29    5.03ns ± 0%  5.00ns ± 0%   -0.68%
BytesCompare/30    5.03ns ± 0%  5.00ns ± 0%   -0.76%
BytesCompare/31    5.03ns ± 0%  5.00ns ± 0%   -0.60%
BytesCompare/32    5.02ns ± 0%  5.05ns ± 0%   +0.56%
BytesCompare/33    6.78ns ± 0%  5.16ns ± 0%  -23.84%
BytesCompare/34    7.26ns ± 0%  5.16ns ± 0%  -28.93%
BytesCompare/35    7.78ns ± 0%  5.16ns ± 0%  -33.65%
BytesCompare/36    6.72ns ± 0%  5.16ns ± 0%  -23.24%
BytesCompare/37    7.32ns ± 0%  5.16ns ± 0%  -29.55%
BytesCompare/38    7.26ns ± 0%  5.16ns ± 0%  -28.95%
BytesCompare/39    7.99ns ± 0%  5.16ns ± 0%  -35.40%
BytesCompare/40    6.67ns ± 0%  5.11ns ± 0%  -23.41%
BytesCompare/41    7.25ns ± 0%  5.14ns ± 0%  -29.05%
BytesCompare/42    7.47ns ± 0%  5.14ns ± 0%  -31.11%
BytesCompare/43    7.97ns ± 0%  5.14ns ± 0%  -35.42%
BytesCompare/44    7.29ns ± 0%  5.14ns ± 0%  -29.38%
BytesCompare/45    8.06ns ± 0%  5.14ns ± 0%  -36.20%
BytesCompare/46    7.89ns ± 0%  5.14ns ± 0%  -34.77%
BytesCompare/47    8.59ns ± 0%  5.14ns ± 0%  -40.13%
BytesCompare/48    5.57ns ± 0%  5.12ns ± 0%   -8.18%
BytesCompare/49    6.05ns ± 0%  5.17ns ± 0%  -14.48%
BytesCompare/50    6.05ns ± 0%  5.17ns ± 0%  -14.51%
BytesCompare/51    6.06ns ± 0%  5.17ns ± 0%  -14.61%
BytesCompare/52    6.05ns ± 0%  5.17ns ± 0%  -14.54%
BytesCompare/53    6.06ns ± 0%  5.17ns ± 0%  -14.56%
BytesCompare/54    6.05ns ± 0%  5.17ns ± 0%  -14.54%
BytesCompare/55    6.05ns ± 0%  5.17ns ± 0%  -14.54%
BytesCompare/56    6.02ns ± 0%  5.11ns ± 0%  -15.13%
BytesCompare/57    6.01ns ± 0%  5.14ns ± 0%  -14.56%
BytesCompare/58    6.02ns ± 0%  5.14ns ± 0%  -14.59%
BytesCompare/59    6.02ns ± 0%  5.14ns ± 0%  -14.65%
BytesCompare/60    6.03ns ± 0%  5.14ns ± 0%  -14.71%
BytesCompare/61    6.02ns ± 0%  5.14ns ± 0%  -14.69%
BytesCompare/62    6.01ns ± 0%  5.14ns ± 0%  -14.55%
BytesCompare/63    6.02ns ± 0%  5.14ns ± 0%  -14.65%
BytesCompare/64    6.09ns ± 0%  5.15ns ± 0%  -15.34%
BytesCompare/65    7.83ns ± 0%  5.93ns ± 0%  -24.17%
BytesCompare/66    7.86ns ± 0%  5.93ns ± 0%  -24.52%
BytesCompare/67    8.56ns ± 0%  5.93ns ± 0%  -30.68%
BytesCompare/68    7.90ns ± 0%  5.93ns ± 0%  -24.88%
BytesCompare/69    8.58ns ± 0%  5.93ns ± 0%  -30.84%
BytesCompare/70    8.54ns ± 0%  5.93ns ± 0%  -30.48%
BytesCompare/71    9.18ns ± 0%  5.94ns ± 0%  -35.34%
BytesCompare/72    7.89ns ± 0%  5.86ns ± 0%  -25.76%
BytesCompare/73    8.59ns ± 0%  5.82ns ± 0%  -32.25%
BytesCompare/74    8.52ns ± 0%  5.82ns ± 0%  -31.61%
BytesCompare/75    9.17ns ± 0%  5.82ns ± 0%  -36.50%
BytesCompare/76    8.54ns ± 0%  5.82ns ± 0%  -31.85%
BytesCompare/77    9.25ns ± 0%  5.82ns ± 0%  -37.07%
BytesCompare/78    9.17ns ± 0%  5.82ns ± 0%  -36.48%
BytesCompare/79    10.0ns ± 0%   5.8ns ± 0%  -41.66%
BytesCompare/80    6.76ns ± 0%  5.69ns ± 0%  -15.90%
BytesCompare/81    7.63ns ± 0%  6.70ns ± 0%  -12.23%
BytesCompare/82    7.63ns ± 0%  6.70ns ± 0%  -12.23%
BytesCompare/83    7.63ns ± 0%  6.70ns ± 0%  -12.24%
BytesCompare/84    7.63ns ± 0%  6.70ns ± 0%  -12.24%
BytesCompare/85    7.63ns ± 0%  6.70ns ± 0%  -12.23%
BytesCompare/86    7.63ns ± 0%  6.70ns ± 0%  -12.24%
BytesCompare/87    7.63ns ± 0%  6.70ns ± 0%  -12.24%
BytesCompare/88    7.53ns ± 0%  6.56ns ± 0%  -12.90%
BytesCompare/89    7.53ns ± 0%  6.55ns ± 0%  -12.93%
BytesCompare/90    7.53ns ± 0%  6.55ns ± 0%  -12.93%
BytesCompare/91    7.53ns ± 0%  6.55ns ± 0%  -12.93%
BytesCompare/92    7.53ns ± 0%  6.55ns ± 0%  -12.93%
BytesCompare/93    7.53ns ± 0%  6.55ns ± 0%  -12.93%
BytesCompare/94    7.53ns ± 0%  6.55ns ± 0%  -12.93%
BytesCompare/95    7.53ns ± 0%  6.55ns ± 0%  -12.94%
BytesCompare/96    7.02ns ± 0%  6.45ns ± 0%   -8.09%
BytesCompare/97    8.73ns ± 0%  7.39ns ± 0%  -15.35%
BytesCompare/98    8.71ns ± 0%  7.39ns ± 0%  -15.15%
BytesCompare/99    9.42ns ± 0%  7.39ns ± 0%  -21.57%
BytesCompare/100   8.73ns ± 0%  7.39ns ± 0%  -15.36%
BytesCompare/101   9.43ns ± 0%  7.39ns ± 0%  -21.70%
BytesCompare/102   9.42ns ± 0%  7.39ns ± 0%  -21.59%
BytesCompare/103   10.2ns ± 0%   7.4ns ± 0%  -27.58%
BytesCompare/104   8.74ns ± 0%  7.35ns ± 0%  -15.95%
BytesCompare/105   9.44ns ± 0%  7.30ns ± 0%  -22.67%
BytesCompare/106   9.44ns ± 0%  7.30ns ± 0%  -22.69%
BytesCompare/107   10.2ns ± 0%   7.3ns ± 0%  -28.53%
BytesCompare/108   9.48ns ± 0%  7.30ns ± 0%  -23.04%
BytesCompare/109   10.2ns ± 0%   7.3ns ± 0%  -28.81%
BytesCompare/110   10.2ns ± 0%   7.3ns ± 0%  -28.39%
BytesCompare/111   10.9ns ± 0%   7.3ns ± 0%  -33.18%
BytesCompare/112   7.75ns ± 0%  7.16ns ± 0%   -7.60%
BytesCompare/113   8.57ns ± 0%  7.83ns ± 0%   -8.60%
BytesCompare/114   8.57ns ± 0%  7.83ns ± 0%   -8.63%
BytesCompare/115   8.57ns ± 0%  7.83ns ± 0%   -8.56%
BytesCompare/116   8.57ns ± 0%  7.83ns ± 0%   -8.57%
BytesCompare/117   8.57ns ± 0%  7.83ns ± 0%   -8.56%
BytesCompare/118   8.57ns ± 0%  7.83ns ± 0%   -8.56%
BytesCompare/119   8.57ns ± 0%  7.83ns ± 0%   -8.61%
BytesCompare/120   8.46ns ± 0%  7.71ns ± 0%   -8.80%
BytesCompare/121   8.46ns ± 0%  7.72ns ± 0%   -8.77%
BytesCompare/122   8.46ns ± 0%  7.72ns ± 0%   -8.78%
BytesCompare/123   8.46ns ± 0%  7.72ns ± 0%   -8.76%
BytesCompare/124   8.46ns ± 0%  7.72ns ± 0%   -8.70%
BytesCompare/125   8.46ns ± 0%  7.72ns ± 0%   -8.70%
BytesCompare/126   8.46ns ± 0%  7.72ns ± 0%   -8.70%
BytesCompare/127   8.46ns ± 0%  7.72ns ± 0%   -8.71%
BytesCompare/128   8.19ns ± 0%  7.35ns ± 0%  -10.29%
BytesCompare/256   12.8ns ± 0%  11.4ns ± 0%  -11.23%
BytesCompare/512   22.2ns ± 0%  20.7ns ± 0%   -6.80%
BytesCompare/1024  41.1ns ± 0%  39.8ns ± 0%   -3.12%
BytesCompare/2048  86.5ns ± 0%  81.1ns ± 0%   -6.31%

Change-Id: I7c7fb1f7b891c23c6cade580e7b9928ca1a6efc3
Reviewed-on: https://go-review.googlesource.com/c/go/+/474496
Run-TryBot: Paul Murphy <murp@ibm.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Reviewed-by: Archana Ravindar <aravind5@in.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2023-03-21 13:10:36 +00:00
Archana R 8377f2014d runtime: improve equal on ppc64x/power10
Rewrite equal asm function to use the new power10 instruction lxvl
and stxvl- load and store with variable length which can simplify
the tail end bytes comparison process. Cleaned up code on CR
register usage.

On power9 and power8 the code remains unchanged. The performance
for multiple sizes<=16 improve on power10 with the change.

name      old time/op    new time/op    delta
Equal/1     5.28ns ± 0%    4.19ns ± 9%  -20.80%
Equal/2     5.30ns ± 0%    4.29ns ± 6%  -19.06%
Equal/3     5.10ns ± 5%    4.20ns ± 6%  -17.73%
Equal/4     5.05ns ± 0%    4.42ns ± 4%  -12.50%
Equal/5     5.27ns ± 1%    4.44ns ± 4%  -15.69%
Equal/6     5.30ns ± 0%    4.38ns ±12%  -17.44%
Equal/7     5.02ns ± 6%    4.48ns ± 2%  -10.64%
Equal/9     4.53ns ± 0%    4.34ns ± 7%   -4.21%
Equal/16    4.52ns ± 0%    4.29ns ± 6%   -5.16%

Change-Id: Ie124906e3a5012dfe634bfe09af06be42f1b178b
Reviewed-on: https://go-review.googlesource.com/c/go/+/473536
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
2023-03-16 17:29:04 +00:00
Joel Sing 71c84d4b41 internal/bytealg: remove aix and linux build tags from ppc64 index code
This code is generic to ppc64/ppc64le - there is no need to limit it to
aix or linux.

Updates #56001

Change-Id: I613964a90f9c5ca637720219a0260d65427f4be0
Reviewed-on: https://go-review.googlesource.com/c/go/+/473697
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2023-03-09 05:34:46 +00:00
Michael Pratt 79d4e894ed all: move //go: function directives directly above functions
These directives affect the next declaration, so the existing form is
valid, but can be confusing because it is easy to miss. Move then
directly above the declaration for improved readability.

CL 69120 previously moved the Gosched nosplit away to hide it from
documentation. Since CL 224737, directives are automatically excluded
from documentation.

Change-Id: I8ebf2d47fbb5e77c6f40ed8afdf79eaa4f4e335e
Reviewed-on: https://go-review.googlesource.com/c/go/+/472957
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2023-03-02 22:56:35 +00:00
Joe Tsai 132fae93b7 bytes, strings: avoid unnecessary zero initialization
Add bytealg.MakeNoZero that specially allocates a []byte
without zeroing it. It assumes the caller will populate every byte.
From within the bytes and strings packages, we can use
bytealg.MakeNoZero in a way where our logic ensures that
the entire slice is overwritten such that uninitialized bytes
are never leaked to the end user.

We use bytealg.MakeNoZero from within the following functions:

* bytes.Join
* bytes.Repeat
* bytes.ToUpper
* bytes.ToLower
* strings.Builder.Grow

The optimization in strings.Builder transitively benefits the following:

* strings.Join
* strings.Map
* strings.Repeat
* strings.ToUpper
* strings.ToLower
* strings.ToValidUTF8
* strings.Replace
* any user logic that depends on strings.Builder

This optimization is especially notable on large buffers that
do not fit in the CPU cache, such that the cost of
runtime.memclr and runtime.memmove are non-trivial since they are
both limited by the relatively slow speed of physical RAM.

Performance:

	RepeatLarge/256/1             66.0ns ± 3%    64.5ns ± 1%      ~     (p=0.095 n=5+5)
	RepeatLarge/256/16            55.4ns ± 5%    53.1ns ± 3%    -4.17%  (p=0.016 n=5+5)
	RepeatLarge/512/1             95.5ns ± 7%    87.1ns ± 2%    -8.78%  (p=0.008 n=5+5)
	RepeatLarge/512/16            84.4ns ± 9%    76.2ns ± 5%    -9.73%  (p=0.016 n=5+5)
	RepeatLarge/1024/1             161ns ± 4%     144ns ± 7%   -10.45%  (p=0.016 n=5+5)
	RepeatLarge/1024/16            148ns ± 3%     141ns ± 5%      ~     (p=0.095 n=5+5)
	RepeatLarge/2048/1             296ns ± 7%     288ns ± 5%      ~     (p=0.841 n=5+5)
	RepeatLarge/2048/16            298ns ± 8%     281ns ± 5%      ~     (p=0.151 n=5+5)
	RepeatLarge/4096/1             593ns ± 8%     539ns ± 8%    -8.99%  (p=0.032 n=5+5)
	RepeatLarge/4096/16            568ns ±12%     526ns ± 7%      ~     (p=0.056 n=5+5)
	RepeatLarge/8192/1            1.15µs ± 8%    1.08µs ±12%      ~     (p=0.095 n=5+5)
	RepeatLarge/8192/16           1.12µs ± 4%    1.07µs ± 7%      ~     (p=0.310 n=5+5)
	RepeatLarge/8192/4097         1.77ns ± 1%    1.76ns ± 2%      ~     (p=0.310 n=5+5)
	RepeatLarge/16384/1           2.06µs ± 7%    1.94µs ± 5%      ~     (p=0.222 n=5+5)
	RepeatLarge/16384/16          2.02µs ± 4%    1.92µs ± 6%      ~     (p=0.095 n=5+5)
	RepeatLarge/16384/4097        1.50µs ±15%    1.44µs ±11%      ~     (p=0.802 n=5+5)
	RepeatLarge/32768/1           3.90µs ± 8%    3.65µs ±11%      ~     (p=0.151 n=5+5)
	RepeatLarge/32768/16          3.92µs ±14%    3.68µs ±12%      ~     (p=0.222 n=5+5)
	RepeatLarge/32768/4097        3.71µs ± 5%    3.43µs ± 4%    -7.54%  (p=0.032 n=5+5)
	RepeatLarge/65536/1           7.47µs ± 8%    6.88µs ± 9%      ~     (p=0.056 n=5+5)
	RepeatLarge/65536/16          7.29µs ± 4%    6.74µs ± 6%    -7.60%  (p=0.016 n=5+5)
	RepeatLarge/65536/4097        7.90µs ±11%    6.34µs ± 5%   -19.81%  (p=0.008 n=5+5)
	RepeatLarge/131072/1          17.0µs ±18%    14.1µs ± 6%   -17.32%  (p=0.008 n=5+5)
	RepeatLarge/131072/16         15.2µs ± 2%    16.2µs ±17%      ~     (p=0.151 n=5+5)
	RepeatLarge/131072/4097       15.7µs ± 6%    14.8µs ±11%      ~     (p=0.095 n=5+5)
	RepeatLarge/262144/1          30.4µs ± 5%    31.4µs ±13%      ~     (p=0.548 n=5+5)
	RepeatLarge/262144/16         30.1µs ± 4%    30.7µs ±11%      ~     (p=1.000 n=5+5)
	RepeatLarge/262144/4097       31.2µs ± 7%    32.7µs ±13%      ~     (p=0.310 n=5+5)
	RepeatLarge/524288/1          67.5µs ± 9%    63.7µs ± 3%      ~     (p=0.095 n=5+5)
	RepeatLarge/524288/16         67.2µs ± 5%    62.9µs ± 6%      ~     (p=0.151 n=5+5)
	RepeatLarge/524288/4097       65.5µs ± 4%    65.2µs ±18%      ~     (p=0.548 n=5+5)
	RepeatLarge/1048576/1          141µs ± 6%     137µs ±14%      ~     (p=0.421 n=5+5)
	RepeatLarge/1048576/16         140µs ± 2%     134µs ±11%      ~     (p=0.222 n=5+5)
	RepeatLarge/1048576/4097       141µs ± 3%     134µs ±10%      ~     (p=0.151 n=5+5)
	RepeatLarge/2097152/1          258µs ± 2%     271µs ±10%      ~     (p=0.222 n=5+5)
	RepeatLarge/2097152/16         263µs ± 6%     273µs ± 9%      ~     (p=0.151 n=5+5)
	RepeatLarge/2097152/4097       270µs ± 2%     277µs ± 6%      ~     (p=0.690 n=5+5)
	RepeatLarge/4194304/1          684µs ± 3%     467µs ± 6%   -31.69%  (p=0.008 n=5+5)
	RepeatLarge/4194304/16         682µs ± 1%     471µs ± 7%   -30.91%  (p=0.008 n=5+5)
	RepeatLarge/4194304/4097       685µs ± 2%     465µs ±20%   -32.12%  (p=0.008 n=5+5)
	RepeatLarge/8388608/1         1.50ms ± 1%    1.16ms ± 8%   -22.63%  (p=0.008 n=5+5)
	RepeatLarge/8388608/16        1.50ms ± 2%    1.22ms ±17%   -18.49%  (p=0.008 n=5+5)
	RepeatLarge/8388608/4097      1.51ms ± 7%    1.33ms ±11%   -11.56%  (p=0.008 n=5+5)
	RepeatLarge/16777216/1        3.48ms ± 4%    2.66ms ±13%   -23.76%  (p=0.008 n=5+5)
	RepeatLarge/16777216/16       3.37ms ± 3%    2.57ms ±13%   -23.72%  (p=0.008 n=5+5)
	RepeatLarge/16777216/4097     3.38ms ± 9%    2.50ms ±11%   -26.16%  (p=0.008 n=5+5)
	RepeatLarge/33554432/1        7.74ms ± 1%    4.70ms ±19%   -39.31%  (p=0.016 n=4+5)
	RepeatLarge/33554432/16       7.90ms ± 4%    4.78ms ± 9%   -39.50%  (p=0.008 n=5+5)
	RepeatLarge/33554432/4097     7.80ms ± 2%    4.86ms ±11%   -37.60%  (p=0.008 n=5+5)
	RepeatLarge/67108864/1        16.4ms ± 3%     9.7ms ±15%   -41.29%  (p=0.008 n=5+5)
	RepeatLarge/67108864/16       16.5ms ± 1%     9.9ms ±15%   -39.83%  (p=0.008 n=5+5)
	RepeatLarge/67108864/4097     16.5ms ± 1%    11.0ms ±18%   -32.95%  (p=0.008 n=5+5)
	RepeatLarge/134217728/1       35.2ms ±12%    19.2ms ±10%   -45.58%  (p=0.008 n=5+5)
	RepeatLarge/134217728/16      34.6ms ± 6%    19.3ms ± 7%   -44.07%  (p=0.008 n=5+5)
	RepeatLarge/134217728/4097    33.2ms ± 2%    19.3ms ±14%   -41.79%  (p=0.008 n=5+5)
	RepeatLarge/268435456/1       70.9ms ± 2%    36.2ms ± 5%   -48.87%  (p=0.008 n=5+5)
	RepeatLarge/268435456/16      77.4ms ± 7%    36.1ms ± 8%   -53.33%  (p=0.008 n=5+5)
	RepeatLarge/268435456/4097    75.8ms ± 4%    37.0ms ± 4%   -51.15%  (p=0.008 n=5+5)
	RepeatLarge/536870912/1        163ms ±14%      77ms ± 9%   -52.94%  (p=0.008 n=5+5)
	RepeatLarge/536870912/16       156ms ± 4%      76ms ± 6%   -51.42%  (p=0.008 n=5+5)
	RepeatLarge/536870912/4097     151ms ± 2%      76ms ± 6%   -49.64%  (p=0.008 n=5+5)
	RepeatLarge/1073741824/1       293ms ± 5%     149ms ± 8%   -49.18%  (p=0.008 n=5+5)
	RepeatLarge/1073741824/16      308ms ± 9%     150ms ± 8%   -51.19%  (p=0.008 n=5+5)
	RepeatLarge/1073741824/4097    299ms ± 5%     151ms ± 6%   -49.51%  (p=0.008 n=5+5)

Updates #57153

Change-Id: I024553b7e676d6da6408278109ac1fa8def0a802
Reviewed-on: https://go-review.googlesource.com/c/go/+/456336
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Joseph Tsai <joetsai@digital-static.net>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2023-02-27 19:11:00 +00:00
Joel Sing 261fe25c83 internal/bytealg: simplify and improve compare on riscv64
Remove some unnecessary loops and pull the comparison code out from the
compare/loop code. Add an unaligned 8 byte comparison, which reads 8 bytes
from each input before comparing them. This gives a reasonable gain in
performance for the large unaligned case.

Updates #50615

name                                 old time/op    new time/op    delta
CompareBytesEqual-4                     116ns _ 0%     111ns _ 0%   -4.10%  (p=0.000 n=5+5)
CompareBytesToNil-4                    34.9ns _ 0%    35.0ns _ 0%   +0.45%  (p=0.002 n=5+5)
CompareBytesEmpty-4                    29.6ns _ 1%    29.8ns _ 0%   +0.71%  (p=0.016 n=5+5)
CompareBytesIdentical-4                29.8ns _ 0%    29.9ns _ 1%   +0.50%  (p=0.036 n=5+5)
CompareBytesSameLength-4               66.1ns _ 0%    60.4ns _ 0%   -8.59%  (p=0.000 n=5+5)
CompareBytesDifferentLength-4          63.1ns _ 0%    60.5ns _ 0%   -4.20%  (p=0.000 n=5+5)
CompareBytesBigUnaligned/offset=1-4    6.84ms _ 3%    6.04ms _ 5%  -11.70%  (p=0.001 n=5+5)
CompareBytesBigUnaligned/offset=2-4    6.99ms _ 4%    5.93ms _ 6%  -15.22%  (p=0.000 n=5+5)
CompareBytesBigUnaligned/offset=3-4    6.74ms _ 1%    6.00ms _ 5%  -10.94%  (p=0.001 n=5+5)
CompareBytesBigUnaligned/offset=4-4    7.20ms _ 6%    5.97ms _ 6%  -17.05%  (p=0.000 n=5+5)
CompareBytesBigUnaligned/offset=5-4    6.75ms _ 1%    5.81ms _ 8%  -13.93%  (p=0.001 n=5+5)
CompareBytesBigUnaligned/offset=6-4    6.89ms _ 5%    5.75ms _ 2%  -16.58%  (p=0.000 n=5+4)
CompareBytesBigUnaligned/offset=7-4    6.91ms _ 6%    6.13ms _ 6%  -11.27%  (p=0.001 n=5+5)
CompareBytesBig-4                      2.75ms _ 5%    2.71ms _ 8%     ~     (p=0.651 n=5+5)
CompareBytesBigIdentical-4             29.9ns _ 1%    29.8ns _ 0%     ~     (p=0.751 n=5+5)

name                                 old speed      new speed      delta
CompareBytesBigUnaligned/offset=1-4   153MB/s _ 3%   174MB/s _ 6%  +13.40%  (p=0.003 n=5+5)
CompareBytesBigUnaligned/offset=2-4   150MB/s _ 4%   177MB/s _ 6%  +18.06%  (p=0.001 n=5+5)
CompareBytesBigUnaligned/offset=3-4   156MB/s _ 1%   175MB/s _ 5%  +12.39%  (p=0.002 n=5+5)
CompareBytesBigUnaligned/offset=4-4   146MB/s _ 6%   176MB/s _ 6%  +20.67%  (p=0.001 n=5+5)
CompareBytesBigUnaligned/offset=5-4   155MB/s _ 1%   181MB/s _ 7%  +16.35%  (p=0.002 n=5+5)
CompareBytesBigUnaligned/offset=6-4   152MB/s _ 5%   182MB/s _ 2%  +19.74%  (p=0.000 n=5+4)
CompareBytesBigUnaligned/offset=7-4   152MB/s _ 6%   171MB/s _ 6%  +12.70%  (p=0.001 n=5+5)
CompareBytesBig-4                     382MB/s _ 5%   388MB/s _ 9%     ~     (p=0.616 n=5+5)
CompareBytesBigIdentical-4           35.1TB/s _ 1%  35.1TB/s _ 0%     ~     (p=0.800 n=5+5)

Change-Id: I127edc376e62a2c529719a4ab172f481e0a81357
Reviewed-on: https://go-review.googlesource.com/c/go/+/431100
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Meng Zhuo <mzh@golangcn.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joedian Reid <joedian@golang.org>
Run-TryBot: Joel Sing <joel@sing.id.au>
2023-02-11 16:16:34 +00:00
Archana R 92c7df116e internal/bytealg: add PCALIGN to indexbodyp9 function on ppc64x
Adding PCALIGN in indexbodyp9 function shows
improvements in some SimonWaldherr benchmarks and one of the index
benchmarks on both Power9 and Power10

name              old time/op  new time/op  delta
Contains          19.8ns ± 0%  15.6ns ± 0%  -21.24%
ContainsNot       21.3ns ± 0%  18.9ns ± 0%  -11.03%
ContainsBytes     19.1ns ± 0%  16.0ns ± 0%  -16.54%
Index/10     17.3ns ± 0%    16.1ns ± 0%  -7.30%
Index/32     59.6ns ± 0%    59.6ns ± 0%  +0.12%
Index/4K     3.68µs ± 0%    3.68µs ± 0%    ~
Index/4M     3.74ms ± 0%    3.74ms ± 0%  -0.00%
Index/64M    59.8ms ± 0%    59.8ms ± 0%    ~

Change-Id: I784e57e0b0f5bac143f57f3a32845219e43d47fd
Reviewed-on: https://go-review.googlesource.com/c/go/+/447595
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-11-07 19:27:43 +00:00
Archana R 6774ddfec7 internal/bytealg: fix bug in index function for ppc64le/power9
The index function was not handling certain corner cases where there
were two more bytes to be examined in the tail end of the string to
complete the comparison. Fix code to ensure that when the string has
to be shifted two more times the correct bytes are examined.
Also hoisted vsplat to V10 so that all paths use the correct value.
Some comments had incorrect register names and corrected the same.
Added the strings that were failing to strings test for verification.

Fixes #56457

Change-Id: Idba7cbc802e3d73c8f4fe89309871cc8447792f5
Reviewed-on: https://go-review.googlesource.com/c/go/+/446135
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Archana Ravindar <ravindararchana@gmail.com>
2022-10-31 12:52:07 +00:00
Wayne Zuo 5d59fa143a all: delete riscv64 non-register ABI fallback path
Change-Id: I9e997b59ffb868575b780b9660df1f5ac322b79a
Reviewed-on: https://go-review.googlesource.com/c/go/+/443556
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
2022-10-26 00:52:05 +00:00
Joel Sing fd82718e06 internal/bytealg: correct alignment checks for compare/memequal on riscv64
On riscv64 we need 8 byte alignment for 8 byte loads - the existing check
was only ensuring 4 byte alignment, which potentially results in unaligned
loads being performed. Unaligned loads incur a significant performance penality
due to the resulting kernel traps and fix ups.

Adjust BenchmarkCompareBytesBigUnaligned so that this issue would have been
more readily visible.

Updates #50615

name                                 old time/op    new time/op      delta
CompareBytesBigUnaligned/offset=1-4    6.98ms _ 5%      6.84ms _ 3%       ~     (p=0.319 n=5+5)
CompareBytesBigUnaligned/offset=2-4    6.75ms _ 1%      6.99ms _ 4%       ~     (p=0.063 n=5+5)
CompareBytesBigUnaligned/offset=3-4    6.84ms _ 1%      6.74ms _ 1%     -1.48%  (p=0.003 n=5+5)
CompareBytesBigUnaligned/offset=4-4     146ms _ 1%         7ms _ 6%    -95.08%  (p=0.000 n=5+5)
CompareBytesBigUnaligned/offset=5-4    7.05ms _ 5%      6.75ms _ 1%       ~     (p=0.079 n=5+5)
CompareBytesBigUnaligned/offset=6-4    7.11ms _ 5%      6.89ms _ 5%       ~     (p=0.177 n=5+5)
CompareBytesBigUnaligned/offset=7-4    7.14ms _ 5%      6.91ms _ 6%       ~     (p=0.165 n=5+5)

name                                 old speed      new speed        delta
CompareBytesBigUnaligned/offset=1-4   150MB/s _ 5%     153MB/s _ 3%       ~     (p=0.336 n=5+5)
CompareBytesBigUnaligned/offset=2-4   155MB/s _ 1%     150MB/s _ 4%       ~     (p=0.058 n=5+5)
CompareBytesBigUnaligned/offset=3-4   153MB/s _ 1%     156MB/s _ 1%     +1.51%  (p=0.004 n=5+5)
CompareBytesBigUnaligned/offset=4-4  7.16MB/s _ 1%  145.79MB/s _ 6%  +1936.23%  (p=0.000 n=5+5)
CompareBytesBigUnaligned/offset=5-4   149MB/s _ 5%     155MB/s _ 1%       ~     (p=0.078 n=5+5)
CompareBytesBigUnaligned/offset=6-4   148MB/s _ 5%     152MB/s _ 5%       ~     (p=0.175 n=5+5)
CompareBytesBigUnaligned/offset=7-4   147MB/s _ 5%     152MB/s _ 6%       ~     (p=0.160 n=5+5)

Change-Id: I2c859e061919db482318ce63b85b808aa973a9ba
Reviewed-on: https://go-review.googlesource.com/c/go/+/431099
Reviewed-by: Meng Zhuo <mzh@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-09-19 18:59:50 +00:00
vpachkov 330cffb869 runtime: remove dead code and unnecessary checks for amd64
Use amd64 assembly header to remove unnecessary cpu flags checks
and dead code that is guaranteed to not be executed when compiling
for specific microarchitectures.

name                  old time/op  new time/op  delta
BytesCompare/1-12     3.88ns ± 1%  3.18ns ± 1%  -18.15%  (p=0.008 n=5+5)
BytesCompare/2-12     3.89ns ± 1%  3.21ns ± 2%  -17.66%  (p=0.008 n=5+5)
BytesCompare/4-12     3.89ns ± 0%  3.17ns ± 0%  -18.62%  (p=0.008 n=5+5)
BytesCompare/8-12     3.44ns ± 2%  3.39ns ± 1%   -1.36%  (p=0.008 n=5+5)
BytesCompare/16-12    3.40ns ± 1%  3.14ns ± 0%   -7.77%  (p=0.008 n=5+5)
BytesCompare/32-12    3.90ns ± 1%  3.65ns ± 0%   -6.19%  (p=0.008 n=5+5)
BytesCompare/64-12    4.96ns ± 1%  4.71ns ± 2%   -4.98%  (p=0.008 n=5+5)
BytesCompare/128-12   6.42ns ± 0%  5.99ns ± 4%   -6.75%  (p=0.008 n=5+5)
BytesCompare/256-12   9.36ns ± 0%  7.40ns ± 0%  -20.97%  (p=0.008 n=5+5)
BytesCompare/512-12   15.9ns ± 1%  11.4ns ± 1%  -28.36%  (p=0.008 n=5+5)
BytesCompare/1024-12  27.0ns ± 0%  19.3ns ± 0%  -28.36%  (p=0.008 n=5+5)
BytesCompare/2048-12  50.2ns ± 0%  43.3ns ± 0%  -13.71%  (p=0.008 n=5+5)
[Geo mean]            7.13ns       6.07ns       -14.86%

name                old speed      new speed      delta
Count/10-12          723MB/s ± 0%   704MB/s ± 1%  -2.73%  (p=0.008 n=5+5)
Count/32-12         2.21GB/s ± 0%  2.12GB/s ± 2%  -4.21%  (p=0.008 n=5+5)
Count/4K-12         1.03GB/s ± 0%  1.03GB/s ± 1%    ~     (p=1.000 n=5+5)
Count/4M-12         1.04GB/s ± 0%  1.02GB/s ± 2%    ~     (p=0.310 n=5+5)
Count/64M-12        1.02GB/s ± 0%  1.01GB/s ± 1%  -1.00%  (p=0.016 n=5+5)
CountEasy/10-12      779MB/s ± 0%   768MB/s ± 1%  -1.48%  (p=0.008 n=5+5)
CountEasy/32-12     2.15GB/s ± 0%  2.09GB/s ± 1%  -2.71%  (p=0.008 n=5+5)
CountEasy/4K-12     45.1GB/s ± 1%  45.2GB/s ± 1%    ~     (p=0.421 n=5+5)
CountEasy/4M-12     36.4GB/s ± 1%  36.5GB/s ± 1%    ~     (p=0.690 n=5+5)
CountEasy/64M-12    16.1GB/s ± 2%  16.4GB/s ± 1%    ~     (p=0.056 n=5+5)
CountSingle/10-12   2.15GB/s ± 2%  2.22GB/s ± 1%  +3.37%  (p=0.008 n=5+5)
CountSingle/32-12   5.86GB/s ± 1%  5.76GB/s ± 1%  -1.55%  (p=0.008 n=5+5)
CountSingle/4K-12   54.6GB/s ± 1%  55.0GB/s ± 1%    ~     (p=0.548 n=5+5)
CountSingle/4M-12   45.9GB/s ± 4%  46.4GB/s ± 2%    ~     (p=0.548 n=5+5)
CountSingle/64M-12  17.3GB/s ± 1%  17.2GB/s ± 2%    ~     (p=1.000 n=5+5)
[Geo mean]          5.11GB/s       5.08GB/s       -0.53%

name          old speed      new speed      delta
Equal/1-12     200MB/s ± 0%   188MB/s ± 1%   -6.11%  (p=0.008 n=5+5)
Equal/6-12    1.20GB/s ± 0%  1.13GB/s ± 1%   -6.38%  (p=0.008 n=5+5)
Equal/9-12    1.67GB/s ± 3%  1.74GB/s ± 1%   +3.83%  (p=0.008 n=5+5)
Equal/15-12   2.82GB/s ± 1%  2.89GB/s ± 1%   +2.63%  (p=0.008 n=5+5)
Equal/16-12   2.96GB/s ± 1%  3.08GB/s ± 1%   +3.95%  (p=0.008 n=5+5)
Equal/20-12   3.33GB/s ± 1%  3.54GB/s ± 1%   +6.36%  (p=0.008 n=5+5)
Equal/32-12   4.57GB/s ± 0%  5.26GB/s ± 1%  +15.09%  (p=0.008 n=5+5)
Equal/4K-12   62.0GB/s ± 1%  65.9GB/s ± 2%   +6.29%  (p=0.008 n=5+5)
Equal/4M-12   23.6GB/s ± 2%  24.8GB/s ± 4%   +5.43%  (p=0.008 n=5+5)
Equal/64M-12  11.1GB/s ± 2%  11.3GB/s ± 1%   +1.69%  (p=0.008 n=5+5)
[Geo mean]    3.91GB/s       4.03GB/s        +3.11%

name                              old speed      new speed      delta
IndexByte/10-12                   2.64GB/s ± 0%  2.69GB/s ± 0%   +1.67%  (p=0.008 n=5+5)
IndexByte/32-12                   6.79GB/s ± 0%  6.27GB/s ± 0%   -7.57%  (p=0.008 n=5+5)
IndexByte/4K-12                   56.2GB/s ± 0%  56.9GB/s ± 0%   +1.27%  (p=0.008 n=5+5)
IndexByte/4M-12                   40.1GB/s ± 1%  41.7GB/s ± 1%   +4.05%  (p=0.008 n=5+5)
IndexByte/64M-12                  17.5GB/s ± 0%  17.7GB/s ± 1%     ~     (p=0.095 n=5+5)
IndexBytePortable/10-12           2.06GB/s ± 1%  2.16GB/s ± 1%   +5.08%  (p=0.008 n=5+5)
IndexBytePortable/32-12           1.40GB/s ± 1%  1.54GB/s ± 1%  +10.05%  (p=0.008 n=5+5)
IndexBytePortable/4K-12           3.99GB/s ± 0%  4.08GB/s ± 0%   +2.16%  (p=0.008 n=5+5)
IndexBytePortable/4M-12           4.05GB/s ± 1%  4.08GB/s ± 2%     ~     (p=0.095 n=5+5)
IndexBytePortable/64M-12          3.80GB/s ± 1%  3.81GB/s ± 0%     ~     (p=0.421 n=5+5)
IndexRune/10-12                    746MB/s ± 1%   752MB/s ± 0%   +0.85%  (p=0.008 n=5+5)
IndexRune/32-12                   2.33GB/s ± 0%  2.42GB/s ± 0%   +3.66%  (p=0.008 n=5+5)
IndexRune/4K-12                   44.4GB/s ± 0%  44.2GB/s ± 0%     ~     (p=0.095 n=5+5)
IndexRune/4M-12                   36.2GB/s ± 1%  36.3GB/s ± 2%     ~     (p=0.841 n=5+5)
IndexRune/64M-12                  16.2GB/s ± 2%  16.3GB/s ± 2%     ~     (p=0.548 n=5+5)
IndexRuneASCII/10-12              2.57GB/s ± 0%  2.58GB/s ± 0%   +0.63%  (p=0.008 n=5+5)
IndexRuneASCII/32-12              6.00GB/s ± 0%  6.30GB/s ± 1%   +4.98%  (p=0.008 n=5+5)
IndexRuneASCII/4K-12              56.7GB/s ± 0%  56.8GB/s ± 1%     ~     (p=0.151 n=5+5)
IndexRuneASCII/4M-12              41.6GB/s ± 1%  41.7GB/s ± 2%     ~     (p=0.151 n=5+5)
IndexRuneASCII/64M-12             17.7GB/s ± 1%  17.6GB/s ± 1%     ~     (p=0.222 n=5+5)
Index/10-12                       1.06GB/s ± 1%  1.06GB/s ± 0%     ~     (p=0.310 n=5+5)
Index/32-12                       3.57GB/s ± 0%  3.56GB/s ± 1%     ~     (p=0.056 n=5+5)
Index/4K-12                       1.02GB/s ± 2%  1.03GB/s ± 0%     ~     (p=0.690 n=5+5)
Index/4M-12                       1.04GB/s ± 0%  1.03GB/s ± 1%     ~     (p=1.000 n=4+5)
Index/64M-12                      1.02GB/s ± 0%  1.02GB/s ± 0%     ~     (p=0.905 n=5+4)
IndexEasy/10-12                   1.12GB/s ± 2%  1.15GB/s ± 1%   +3.10%  (p=0.008 n=5+5)
IndexEasy/32-12                   3.14GB/s ± 2%  3.13GB/s ± 1%     ~     (p=0.310 n=5+5)
IndexEasy/4K-12                   47.6GB/s ± 1%  47.7GB/s ± 2%     ~     (p=0.310 n=5+5)
IndexEasy/4M-12                   36.4GB/s ± 1%  36.3GB/s ± 2%     ~     (p=0.690 n=5+5)
IndexEasy/64M-12                  16.1GB/s ± 1%  16.4GB/s ± 5%     ~     (p=0.151 n=5+5)
[Geo mean]                        6.39GB/s       6.46GB/s        +1.11%

Change-Id: Ic1ca62f5cc719d87e2c4aeff25ad73507facff82
Reviewed-on: https://go-review.googlesource.com/c/go/+/397576
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2022-08-18 17:17:01 +00:00
Xiaodong Liu d28bf6c9a2 internal/bytealg: support basic byte operation on loong64
Contributors to the loong64 port are:
  Weining Lu <luweining@loongson.cn>
  Lei Wang <wanglei@loongson.cn>
  Lingqin Gong <gonglingqin@loongson.cn>
  Xiaolin Zhao <zhaoxiaolin@loongson.cn>
  Meidan Li <limeidan@loongson.cn>
  Xiaojuan Zhai <zhaixiaojuan@loongson.cn>
  Qiyuan Pu <puqiyuan@loongson.cn>
  Guoqi Chen <chenguoqi@loongson.cn>

This port has been updated to Go 1.15.6:
  https://github.com/loongson/go

Updates #46229

Change-Id: I4ac6d38dc632abfa0b698325ca0ae349c0d7ecd3
Reviewed-on: https://go-review.googlesource.com/c/go/+/342316
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-05-17 19:55:37 +00:00
Archana R 7ca1e2aa56 internal/bytealg: optimize index function for ppc64le/power9
Optimized index2to16 loop by unrolling the loop by 4.
Multiple benchmark tests show performance improvement on
POWER9. Similar improvements are seen on POWER10. Added
tests to ensure changes work fine.

name            old time/op   new time/op    delta
Index/10         18.3ns ± 0%    19.7ns ±25%     ~
Index/32         75.3ns ± 0%    69.2ns ± 0%   -8.22%
Index/4K         5.53µs ± 0%    3.69µs ± 0%  -33.20%
Index/4M         5.64ms ± 0%    3.75ms ± 0%  -33.55%
Index/64M        92.9ms ± 0%    61.6ms ± 0%  -33.69%
IndexHard2       1.41ms ± 0%    0.93ms ± 0%  -33.75%
CountHard2       1.41ms ± 0%    0.93ms ± 0%  -33.75%

Change-Id: If9331df6a141a4716724b8cb648d2b91bdf17e5f
Reviewed-on: https://go-review.googlesource.com/c/go/+/377016
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
2022-05-09 12:02:02 +00:00
Meng Zhuo d0cda4d95f internal/bytealg: mask high bit for riscv64 regabi
This CL masks byte params which high bits(~0xff) is unused for riscv64
regabi.
Currently the compiler only guarantees the low bits contains value.

Change-Id: I6dd6c867e60d2143fefde92c866f78c4b007a2f7
Reviewed-on: https://go-review.googlesource.com/c/go/+/402894
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: mzh <mzh@golangcn.org>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
2022-05-03 14:36:37 +00:00
Paul E. Murphy 8375b54d44 internal/bytealg: improve PPC64 equal
Rewrite the vector loop to process 64B per iteration,
this greatly improves performance on POWER9/POWER10 for
large sizes.

Likewise, use a similar tricks for sizes >= 8 and <= 64.

And, rewrite small comparisons, it's a little slower for 1 byte,
but constant time for 1-7 bytes. Thus, it is increasingly faster
for 2-7B.

Benchmarks results below are from P8/P9 ppc64le (in that order),
several additional testcases have been added to test interesting
sizes. Likewise, the old variant was padded to the same code
size of the new variant to minimize layout related noise:

POWER8/ppc64le/linux:
name       old speed      new speed       delta
Equal/1     110MB/s ± 0%    106MB/s ± 0%    -3.26%
Equal/2     202MB/s ± 0%    203MB/s ± 0%    +0.18%
Equal/3     280MB/s ± 0%    319MB/s ± 0%   +13.89%
Equal/4     350MB/s ± 0%    414MB/s ± 0%   +18.27%
Equal/5     412MB/s ± 0%    533MB/s ± 0%   +29.19%
Equal/6     462MB/s ± 0%    620MB/s ± 0%   +34.11%
Equal/7     507MB/s ± 0%    745MB/s ± 0%   +47.02%
Equal/8     913MB/s ± 0%    994MB/s ± 0%    +8.84%
Equal/9     909MB/s ± 0%   1117MB/s ± 0%   +22.85%
Equal/10    937MB/s ± 0%   1242MB/s ± 0%   +32.59%
Equal/11    962MB/s ± 0%   1370MB/s ± 0%   +42.37%
Equal/12    989MB/s ± 0%   1490MB/s ± 0%   +50.60%
Equal/13   1.01GB/s ± 0%   1.61GB/s ± 0%   +60.27%
Equal/14   1.02GB/s ± 0%   1.74GB/s ± 0%   +71.22%
Equal/15   1.03GB/s ± 0%   1.86GB/s ± 0%   +81.45%
Equal/16   1.60GB/s ± 0%   1.99GB/s ± 0%   +24.21%
Equal/17   1.54GB/s ± 0%   2.04GB/s ± 0%   +32.28%
Equal/20   1.48GB/s ± 0%   2.40GB/s ± 0%   +62.64%
Equal/32   3.58GB/s ± 0%   3.84GB/s ± 0%    +7.18%
Equal/63   3.74GB/s ± 0%   7.17GB/s ± 0%   +91.79%
Equal/64   6.35GB/s ± 0%   7.29GB/s ± 0%   +14.75%
Equal/65   5.85GB/s ± 0%   7.00GB/s ± 0%   +19.66%
Equal/127  6.74GB/s ± 0%  13.74GB/s ± 0%  +103.77%
Equal/128  10.6GB/s ± 0%   12.9GB/s ± 0%   +21.98%
Equal/129  9.66GB/s ± 0%  11.96GB/s ± 0%   +23.85%
Equal/191  9.12GB/s ± 0%  17.80GB/s ± 0%   +95.26%
Equal/192  13.4GB/s ± 0%   17.2GB/s ± 0%   +28.66%
Equal/4K   29.5GB/s ± 0%   37.3GB/s ± 0%   +26.39%
Equal/4M   22.6GB/s ± 0%   23.1GB/s ± 0%    +2.40%
Equal/64M  10.6GB/s ± 0%   11.2GB/s ± 0%    +5.83%

POWER9/ppc64le/linux:
name       old speed      new speed       delta
Equal/1     122MB/s ± 0%    121MB/s ± 0%    -0.94%
Equal/2     223MB/s ± 0%    241MB/s ± 0%    +8.29%
Equal/3     289MB/s ± 0%    362MB/s ± 0%   +24.90%
Equal/4     366MB/s ± 0%    483MB/s ± 0%   +31.82%
Equal/5     427MB/s ± 0%    603MB/s ± 0%   +41.28%
Equal/6     462MB/s ± 0%    723MB/s ± 0%   +56.65%
Equal/7     509MB/s ± 0%    843MB/s ± 0%   +65.57%
Equal/8     974MB/s ± 0%   1066MB/s ± 0%    +9.46%
Equal/9    1.00GB/s ± 0%   1.20GB/s ± 0%   +19.53%
Equal/10   1.00GB/s ± 0%   1.33GB/s ± 0%   +32.81%
Equal/11   1.01GB/s ± 0%   1.47GB/s ± 0%   +45.28%
Equal/12   1.04GB/s ± 0%   1.60GB/s ± 0%   +53.46%
Equal/13   1.05GB/s ± 0%   1.73GB/s ± 0%   +64.67%
Equal/14   1.02GB/s ± 0%   1.87GB/s ± 0%   +82.93%
Equal/15   1.04GB/s ± 0%   2.00GB/s ± 0%   +92.07%
Equal/16   1.83GB/s ± 0%   2.13GB/s ± 0%   +16.58%
Equal/17   1.78GB/s ± 0%   2.18GB/s ± 0%   +22.65%
Equal/20   1.72GB/s ± 0%   2.57GB/s ± 0%   +49.24%
Equal/32   3.89GB/s ± 0%   4.10GB/s ± 0%    +5.53%
Equal/63   3.63GB/s ± 0%   7.63GB/s ± 0%  +110.45%
Equal/64   6.69GB/s ± 0%   7.75GB/s ± 0%   +15.84%
Equal/65   6.28GB/s ± 0%   7.07GB/s ± 0%   +12.46%
Equal/127  6.41GB/s ± 0%  13.65GB/s ± 0%  +112.95%
Equal/128  11.1GB/s ± 0%   14.1GB/s ± 0%   +26.56%
Equal/129  10.2GB/s ± 0%   11.2GB/s ± 0%    +9.44%
Equal/191  8.64GB/s ± 0%  16.39GB/s ± 0%   +89.75%
Equal/192  15.3GB/s ± 0%   17.8GB/s ± 0%   +16.31%
Equal/4K   24.6GB/s ± 0%   27.8GB/s ± 0%   +13.12%
Equal/4M   21.1GB/s ± 0%   22.7GB/s ± 0%    +7.66%
Equal/64M  20.8GB/s ± 0%   22.4GB/s ± 0%    +8.06%

Change-Id: Ie3c582133d526cc14e8846ef364c44c93eb7b9a2
Reviewed-on: https://go-review.googlesource.com/c/go/+/399976
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2022-05-02 20:18:15 +00:00
Archana R 78fb1d03d3 internal/bytealg: optimize cmpbody for ppc64le/ppc64
Vectorize the cmpbody loop for bytes of size greater than or equal
to 32 on both POWER8(LE and BE) and POWER9(LE and BE) and improve
performance of smaller size compares

Performance improves for most sizes with this change on POWER8, 9
and POWER10. For the very small sizes (upto 8) the overhead of
calling function starts to impact performance.

POWER9:
name               old time/op  new time/op  delta
BytesCompare/1     4.60ns ± 0%  5.49ns ± 0%  +19.27%
BytesCompare/2     4.68ns ± 0%  5.46ns ± 0%  +16.71%
BytesCompare/4     6.58ns ± 0%  5.49ns ± 0%  -16.58%
BytesCompare/8     4.89ns ± 0%  5.46ns ± 0%  +11.64%
BytesCompare/16    5.21ns ± 0%  4.96ns ± 0%   -4.70%
BytesCompare/32    5.09ns ± 0%  4.98ns ± 0%   -2.14%
BytesCompare/64    6.40ns ± 0%  5.96ns ± 0%   -6.84%
BytesCompare/128   11.3ns ± 0%   8.1ns ± 0%  -28.09%
BytesCompare/256   15.1ns ± 0%  12.8ns ± 0%  -15.16%
BytesCompare/512   26.5ns ± 0%  23.3ns ± 5%  -12.03%
BytesCompare/1024  50.2ns ± 0%  41.6ns ± 2%  -17.01%
BytesCompare/2048  99.3ns ± 0%  86.5ns ± 0%  -12.88%

Change-Id: I24f93b2910591e6829ddd8509aa6eeaa6355c609
Reviewed-on: https://go-review.googlesource.com/c/go/+/362797
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
2022-04-22 12:12:38 +00:00
Meng Zhuo be1d7388b3 internal/bytealg: port bytealg functions to reg ABI on riscv64
This CL adds support for the reg ABI to the bytes functions for
riscv64. These are initially under control of the
GOEXPERIMENT macro until all changes are in.

Change-Id: I026295ae38e2aba055f7a51c77f92c1921e5ec97
Reviewed-on: https://go-review.googlesource.com/c/go/+/361916
Run-TryBot: mzh <mzh@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-04-22 01:41:25 +00:00
Archana R 5c707f5f3a internal/bytealg: optimize indexbyte function for ppc64le/power9
Added specific code For POWER9 that does not need prealignment prior
to load vector. Optimized vector loop to jump out as soon as there is
a match instead of accumulating matches for 4 indices and then processing
the same. For small input size 10, the caller function dominates
performance.

name                      old time/op    new time/op    delta
IndexByte/10                9.20ns ± 0%   10.40ns ± 0%  +13.08%
IndexByte/32                9.77ns ± 0%    9.20ns ± 0%   -5.84%
IndexByte/4K                 171ns ± 0%     136ns ± 0%  -20.51%
IndexByte/4M                 154µs ± 0%     126µs ± 0%  -17.92%
IndexByte/64M               2.48ms ± 0%    2.03ms ± 0%  -18.27%
IndexAnyASCII/1:32          10.2ns ± 1%     9.2ns ± 0%   -9.19%
IndexAnyASCII/1:64          11.3ns ± 0%    10.1ns ± 0%  -11.29%
IndexAnyUTF8/1:64           11.4ns ± 0%     9.8ns ± 0%  -13.73%
IndexAnyUTF8/16:64           156ns ± 1%     131ns ± 0%  -16.23%
IndexAnyUTF8/256:64         2.27µs ± 0%    1.86µs ± 0%  -18.03%
LastIndexAnyUTF8/1:64       11.8ns ± 0%    10.5ns ± 0%  -10.81%
LastIndexAnyUTF8/16:64       165ns ±11%     132ns ± 0%  -19.75%
LastIndexAnyUTF8/256:2      1.68µs ± 0%    1.44µs ± 0%  -14.33%
LastIndexAnyUTF8/256:4      1.68µs ± 0%    1.49µs ± 0%  -11.10%
LastIndexAnyUTF8/256:8      1.68µs ± 0%    1.50µs ± 0%  -11.05%
LastIndexAnyUTF8/256:64     2.30µs ± 0%    1.90µs ± 0%  -17.56%
Change-Id: I3d2550bdfdea38fece2da9960bbe62fe6cb1840c
Reviewed-on: https://go-review.googlesource.com/c/go/+/397614
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2022-04-15 16:05:09 +00:00
Cherry Mui 0a69c98214 all: delete PPC64 non-register ABI fallback path
Change-Id: Ie058c0549167b256ad943a0134907df3aca4a69f
Reviewed-on: https://go-review.googlesource.com/c/go/+/394215
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-03-28 18:20:56 +00:00
Cherry Mui 9f252a0462 all: delete ARM64 non-register ABI fallback path
Change-Id: I3996fb31789a1f8559348e059cf371774e548a8d
Reviewed-on: https://go-review.googlesource.com/c/go/+/393875
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2022-03-18 18:26:13 +00:00
Meng Zhuo 9faef5a654 cmd/compile,bytealg: change context register on riscv64
The register ABI will use X8-X23 (CL 356519),
this CL changes context register from X20(S4) to X26(S10) to meet the
prerequisite.

Update #40724

Change-Id: I93d51d22fe7b3ea5ceffe96dff93e3af60fbe7f6
Reviewed-on: https://go-review.googlesource.com/c/go/+/357974
Trust: mzh <mzh@golangcn.org>
Run-TryBot: mzh <mzh@golangcn.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-03-10 01:28:37 +00:00
Joel Sing 291bda80e5 internal/bytealg: optimise compare on riscv64
Implement compare using loops that process 32 bytes, 16 bytes, 4 bytes
or 1 byte depending on size and alignment. For comparisons that are less
than 32 bytes the overhead of checking and adjusting alignment usually
exceeds the overhead of reading and processing 4 bytes at a time.

Updates #50615

name                           old time/op    new time/op     delta
BytesCompare/1-4                 68.4ns _ 1%     61.0ns _ 0%   -10.78%  (p=0.001 n=3+3)
BytesCompare/2-4                 82.9ns _ 0%     71.0ns _ 1%   -14.31%  (p=0.000 n=3+3)
BytesCompare/4-4                  107ns _ 0%       70ns _ 0%   -34.96%  (p=0.000 n=3+3)
BytesCompare/8-4                  156ns _ 0%       90ns _ 0%   -42.36%  (p=0.000 n=3+3)
BytesCompare/16-4                 267ns _11%      130ns _ 0%   -51.10%  (p=0.011 n=3+3)
BytesCompare/32-4                 446ns _ 0%       74ns _ 0%   -83.31%  (p=0.000 n=3+3)
BytesCompare/64-4                 840ns _ 2%       91ns _ 0%   -89.17%  (p=0.000 n=3+3)
BytesCompare/128-4               1.60_s _ 0%     0.13_s _ 0%   -92.18%  (p=0.000 n=3+3)
BytesCompare/256-4               3.15_s _ 0%     0.19_s _ 0%   -93.91%  (p=0.000 n=3+3)
BytesCompare/512-4               6.25_s _ 0%     0.33_s _ 0%   -94.80%  (p=0.000 n=3+3)
BytesCompare/1024-4              12.5_s _ 0%      0.6_s _ 0%   -95.23%  (p=0.000 n=3+3)
BytesCompare/2048-4              24.8_s _ 0%      1.1_s _ 0%   -95.46%  (p=0.000 n=3+3)
CompareBytesEqual-4               225ns _ 0%      131ns _ 0%   -41.69%  (p=0.000 n=3+3)
CompareBytesToNil-4              45.3ns _ 7%     46.7ns _ 0%      ~     (p=0.452 n=3+3)
CompareBytesEmpty-4              41.0ns _ 1%     40.6ns _ 0%      ~     (p=0.071 n=3+3)
CompareBytesIdentical-4          48.9ns _ 0%     41.3ns _ 1%   -15.58%  (p=0.000 n=3+3)
CompareBytesSameLength-4          127ns _ 0%       77ns _ 0%   -39.48%  (p=0.000 n=3+3)
CompareBytesDifferentLength-4     136ns _12%       78ns _ 0%   -42.65%  (p=0.018 n=3+3)
CompareBytesBigUnaligned-4       14.9ms _ 1%      7.3ms _ 1%   -50.95%  (p=0.000 n=3+3)
CompareBytesBig-4                14.9ms _ 1%      2.7ms _ 8%   -82.10%  (p=0.000 n=3+3)
CompareBytesBigIdentical-4       52.5ns _ 0%     44.9ns _ 0%   -14.53%  (p=0.000 n=3+3)

name                           old speed      new speed       delta
CompareBytesBigUnaligned-4     70.5MB/s _ 1%  143.8MB/s _ 1%  +103.87%  (p=0.000 n=3+3)
CompareBytesBig-4              70.3MB/s _ 1%  393.8MB/s _ 8%  +460.43%  (p=0.003 n=3+3)
CompareBytesBigIdentical-4     20.0TB/s _ 0%   23.4TB/s _ 0%   +17.00%  (p=0.000 n=3+3)

Change-Id: Ie18712a9009d425c75e1ab49d5a673d84e73a1eb
Reviewed-on: https://go-review.googlesource.com/c/go/+/380076
Trust: Joel Sing <joel@sing.id.au>
Trust: mzh <mzh@golangcn.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-03-08 15:42:25 +00:00
Joel Sing 3c42ebf3e2 internal/bytealg: optimise memequal on riscv64
Implement memequal using loops that process 32 bytes, 16 bytes, 4 bytes
or 1 byte depending on size and alignment. For comparisons that are less
than 32 bytes the overhead of checking and adjusting alignment usually
exceeds the overhead of reading and processing 4 bytes at a time.

Updates #50615

name                 old time/op    new time/op     delta
Equal/0-4              38.3ns _ 0%     43.1ns _ 0%    +12.54%  (p=0.000 n=3+3)
Equal/1-4              77.7ns _ 0%     90.3ns _ 0%    +16.27%  (p=0.000 n=3+3)
Equal/6-4               116ns _ 0%      121ns _ 0%     +3.85%  (p=0.002 n=3+3)
Equal/9-4               137ns _ 0%      126ns _ 0%     -7.98%  (p=0.000 n=3+3)
Equal/15-4              179ns _ 0%      170ns _ 0%     -4.77%  (p=0.001 n=3+3)
Equal/16-4              186ns _ 0%      159ns _ 0%    -14.65%  (p=0.000 n=3+3)
Equal/20-4              215ns _ 0%      178ns _ 0%    -17.18%  (p=0.000 n=3+3)
Equal/32-4              298ns _ 0%      101ns _ 0%    -66.22%  (p=0.000 n=3+3)
Equal/4K-4             28.9_s _ 0%      2.2_s _ 0%    -92.56%  (p=0.000 n=3+3)
Equal/4M-4             29.6ms _ 0%      2.2ms _ 0%    -92.72%  (p=0.000 n=3+3)
Equal/64M-4             758ms _75%       35ms _ 0%       ~     (p=0.127 n=3+3)
CompareBytesEqual-4     226ns _ 0%      131ns _ 0%    -41.76%  (p=0.000 n=3+3)

name                 old speed      new speed       delta
Equal/1-4            12.9MB/s _ 0%   11.1MB/s _ 0%    -13.98%  (p=0.000 n=3+3)
Equal/6-4            51.7MB/s _ 0%   49.8MB/s _ 0%     -3.72%  (p=0.002 n=3+3)
Equal/9-4            65.7MB/s _ 0%   71.4MB/s _ 0%     +8.67%  (p=0.000 n=3+3)
Equal/15-4           83.8MB/s _ 0%   88.0MB/s _ 0%     +5.02%  (p=0.001 n=3+3)
Equal/16-4           85.9MB/s _ 0%  100.6MB/s _ 0%    +17.19%  (p=0.000 n=3+3)
Equal/20-4           93.2MB/s _ 0%  112.6MB/s _ 0%    +20.74%  (p=0.000 n=3+3)
Equal/32-4            107MB/s _ 0%    317MB/s _ 0%   +195.97%  (p=0.000 n=3+3)
Equal/4K-4            142MB/s _ 0%   1902MB/s _ 0%  +1243.76%  (p=0.000 n=3+3)
Equal/4M-4            142MB/s _ 0%   1946MB/s _ 0%  +1274.22%  (p=0.000 n=3+3)
Equal/64M-4           111MB/s _55%   1941MB/s _ 0%  +1641.21%  (p=0.000 n=3+3)

Change-Id: I9af7e82de3c4c5af8813772ed139230900c03b92
Reviewed-on: https://go-review.googlesource.com/c/go/+/380075
Trust: Joel Sing <joel@sing.id.au>
Trust: mzh <mzh@golangcn.org>
Reviewed-by: mzh <mzh@golangcn.org>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-03-08 15:39:49 +00:00
Tobias Klauser f19e400180 all: remove more leftover // +build lines
CL 344955 and CL 359476 removed almost all // +build lines, but leaving
some assembly files and generating scripts. Also, some files were added
with // +build lines after CL 359476 was merged. Remove these or rename
files where more appropriate.

For #41184

Change-Id: I7eb85a498ed9788b42a636e775f261d755504ffa
Reviewed-on: https://go-review.googlesource.com/c/go/+/361480
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
2021-11-06 10:24:44 +00:00
Russ Cox f229e7031a all: go fix -fix=buildtag std cmd (except for bootstrap deps, vendor)
When these packages are released as part of Go 1.18,
Go 1.16 will no longer be supported, so we can remove
the +build tags in these files.

Ran go fix -fix=buildtag std cmd and then reverted the bootstrapDirs
as defined in src/cmd/dist/buildtool.go, which need to continue
to build with Go 1.4 for now.

Also reverted src/vendor and src/cmd/vendor, which will need
to be updated in their own repos first.

Manual changes in runtime/pprof/mprof_test.go to adjust line numbers.

For #41184.

Change-Id: Ic0f93f7091295b6abc76ed5cd6e6746e1280861e
Reviewed-on: https://go-review.googlesource.com/c/go/+/344955
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
2021-10-28 18:17:57 +00:00
Archana R 6ec9a1da2d internal/bytealg: fix Separator length check for Index/ppc64le
Modified condition in the ASM implementation of indexbody to
determine if separator length crosses 16 bytes to BGT from BGE
to avoid incorrectly crossing a page.

Also fixed IndexString to invoke indexbodyp9 when on the POWER9
platform

Change-Id: I0602a797cc75287990eea1972e9e473744f6f5a9
Reviewed-on: https://go-review.googlesource.com/c/go/+/356849
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Trust: Keith Randall <khr@golang.org>
2021-10-21 15:45:05 +00:00
Archana R 6c3cd5d2eb internal/bytealg: port bytes.Index and bytes.Count to reg ABI on ppc64x
This change adds support for the reg ABI to the Index and Count
functions for ppc64/ppc64le.

Most Index and Count benchmarks show improvement in performance on
POWER9 with this change. Similar numbers observed on POWER8 and POWER10.

name                             old time/op    new time/op    delta
Index/32                         71.0ns ± 0%    67.9ns ± 0%   -4.42% (p=0.001 n=7+6)
IndexEasy/10                     17.5ns ± 0%    17.2ns ± 0%   -1.30% (p=0.001 n=7+7)

name             old time/op    new time/op    delta
Count/10           26.6ns ± 0%    25.0ns ± 1%   -6.02%  (p=0.001 n=7+7)
Count/32           78.6ns ± 0%    74.7ns ± 0%   -4.97%  (p=0.001 n=7+7)
Count/4K           5.03µs ± 0%    5.03µs ± 0%   -0.07%  (p=0.000 n=6+7)
CountEasy/10       26.9ns ± 0%    25.2ns ± 1%   -6.31%  (p=0.001 n=7+7)
CountSingle/32     11.8ns ± 0%     9.9ns ± 0%  -15.70%  (p=0.002 n=6+6)

Change-Id: Ibd146c04f8107291c55f9e6100b8264dfccc41ae
Reviewed-on: https://go-review.googlesource.com/c/go/+/355509
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
2021-10-19 18:22:59 +00:00
Lynn Boger b8a601756a internal/bytealg: port bytealg functions to reg ABI on ppc64x
This adds support for the reg ABI to the bytes functions for
ppc64/ppc64le. These are initially under control of the
GOEXPERIMENT macro until all changes are in.

Change-Id: Id82f31056af8caa8541e27c6735f6b815a5dbf5a
Reviewed-on: https://go-review.googlesource.com/c/go/+/351190
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-09-28 20:54:26 +00:00
Martin Möhrmann 8157960d7f all: replace runtime SSE2 detection with GO386 setting
When GO386=sse2 we can assume sse2 to be present without
a runtime check. If GO386=softfloat is set we can avoid
the usage of SSE2 even if detected.

This might cause a memcpy, memclr and bytealg slowdown of Go
binaries compiled with softfloat on machines that support
SSE2. Such setups are rare and should use GO386=sse2 instead
if performance matters.

On targets that support SSE2 we avoid the runtime overhead of
dynamic cpu feature dispatch.

The removal of runtime sse2 checks also allows to simplify
internal/cpu further by removing handling of the required
feature option as a followup after this CL.

Change-Id: I90a853a8853a405cb665497c6d1a86556947ba17
Reviewed-on: https://go-review.googlesource.com/c/go/+/344350
Trust: Martin Möhrmann <martin@golang.org>
Run-TryBot: Martin Möhrmann <martin@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2021-08-23 21:22:58 +00:00
Cherry Mui d7d4f28a06 [dev.typeparams] runtime, internal/bytealg: remove regabi fallback code on AMD64
As we commit to always enabling register ABI on AMD64, remove the
fallback code.

Change-Id: I30556858ba4bac367495fa94f6a8682ecd771196
Reviewed-on: https://go-review.googlesource.com/c/go/+/341152
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Austin Clements <austin@google.com>
2021-08-11 17:08:54 +00:00
Cuong Manh Le 4d6f9d60cf [dev.typeparams] all: merge master (785a8f6) into dev.typeparams
- test/run.go

  CL 328050 added fixedbugs/issue46749.go to -G=3 excluded files list

Merge List:

+ 2021-06-16 785a8f677f cmd/compile: better error message for invalid untyped operation
+ 2021-06-16 a752bc0746 syscall: fix TestGroupCleanupUserNamespace test failure on Fedora
+ 2021-06-15 d77f4c0c5c net/http: improve some server docs
+ 2021-06-15 219fe9d547 cmd/go: ignore UTF8 BOM when reading source code
+ 2021-06-15 723f199edd cmd/link: set correct flags in .dynamic for PIE buildmode
+ 2021-06-15 4d2d89ff42 cmd/go, go/build: update docs to use //go:build syntax
+ 2021-06-15 033d885315 doc/go1.17: document go run pkg@version
+ 2021-06-15 ea8612ef42 syscall: disable c-shared test when no cgo, for windows/arm
+ 2021-06-15 abc56fd1a0 internal/bytealg: remove duplicate go:build line
+ 2021-06-15 4061d3463b syscall: rewrite handle inheritance test to use C rather than Powershell
+ 2021-06-15 cf4e3e3d3b reflect: explain why convertible or comparable types may still panic
+ 2021-06-14 7841cb14d9 doc/go1.17: assorted fixes
+ 2021-06-14 8a5a6f46dc debug/elf: don't apply DWARF relocations for ET_EXEC binaries
+ 2021-06-14 9d13f8d43e runtime: update the variable name in comment
+ 2021-06-14 0fd20ed5b6 reflect: use same conversion panic in reflect and runtime
+ 2021-06-14 6bbb0a9d4a cmd/internal/sys: mark windows/arm64 as c-shared-capable
+ 2021-06-14 d4f34f8c63 doc/go1.17: reword "results" in stack trace printing

Change-Id: I60d1f67c4d48cd4093c350fc89bd60c454d23944
2021-06-16 14:59:13 +07:00
cuishuang abc56fd1a0 internal/bytealg: remove duplicate go:build line
Change-Id: I6b71bf468b9544820829f02e320673f5edd785fa
GitHub-Last-Rev: 8082ac5fba
GitHub-Pull-Request: golang/go#46683
Reviewed-on: https://go-review.googlesource.com/c/go/+/326730
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Trust: Tobias Klauser <tobias.klauser@gmail.com>
2021-06-15 10:13:08 +00:00
Cherry Mui 5a40fab19f [dev.typeparams] runtime, internal/bytealg: port performance-critical functions to register ABI on ARM64
This CL ports a few performance-critical assembly functions to use
register arguments directly. This is similar to CL 308931 and
CL 310184.

Change-Id: I6e30dfff17f76b8578ce8cfd51de21b66610fdb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/324400
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
2021-06-03 16:28:19 +00:00
Cherry Mui 5a008a92e8 [dev.typeparams] internal/bytealg: call memeqbody directly in memequal_varlen on ARM64
Currently, memequal_varlen opens up a frame and call memequal,
which then tail-calls memeqbody. This CL changes memequal_varlen
tail-calls memeqbody directly.

This makes it simpler to switch to the register ABI in the next
CL.

Change-Id: Ia1367c0abb7f4755fe736c404411793fb9e5c04f
Reviewed-on: https://go-review.googlesource.com/c/go/+/324399
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2021-06-03 14:29:14 +00:00
Tobias Klauser 2c76a6f7f8 all: add //go:build lines to assembly files
Don't add them to files in vendor and cmd/vendor though. These will be
pulled in by updating the respective dependencies.

For #41184

Change-Id: Icc57458c9b3033c347124323f33084c85b224c70
Reviewed-on: https://go-review.googlesource.com/c/go/+/319389
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2021-05-13 09:12:17 +00:00
Lynn Boger 54af9fd9e6 internal/bytealg: add power9 version of bytes index
This adds a power9 version of the bytes.Index function
for little endian.

Here is the improvement on power9 for some of the Index
benchmarks:

Index/10           -0.14%
Index/32           -3.19%
Index/4K          -12.66%
Index/4M          -13.34%
Index/64M         -13.17%
Count/10           -0.59%
Count/32           -2.88%
Count/4K          -12.63%
Count/4M          -13.35%
Count/64M         -13.17%
IndexHard1        -23.03%
IndexHard2        -13.01%
IndexHard3        -22.12%
IndexHard4         +0.16%
CountHard1        -23.02%
CountHard2        -13.01%
CountHard3        -22.12%
IndexPeriodic/IndexPeriodic2  -22.85%
IndexPeriodic/IndexPeriodic4  -23.15%

Change-Id: Id72353e2771eba2efbb1544d5f0be65f8a9f0433
Reviewed-on: https://go-review.googlesource.com/c/go/+/311380
Run-TryBot: Carlos Eduardo Seo <carlos.seo@linaro.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2021-04-21 21:14:04 +00:00
Lynn Boger 8009a81f7a bytes: add asm implementation for index on ppc64x
This adds an asm implementation of index on ppc64le and
ppc64. It results in a significant improvement in some
of the benchmarks that use bytes.Index.

The implementation is based on a port of the s390x asm
implementation. Comments on the design are found
with the code.

The following improvements occurred on power8:

Index/10       70.7ns ± 0%    18.8ns ± 0%   -73.4
Index/32        165ns ± 0%      95ns ± 0%   -42.6
Index/4K       9.23µs ± 0%    4.91µs ± 0%   -46
Index/4M       9.52ms ± 0%    5.10ms ± 0%   -46.4
Index/64M       155ms ± 0%      85ms ± 0%   -45.1

Count/10       83.0ns ± 0%    32.1ns ± 0%   -61.3
Count/32        178ns ± 0%     109ns ± 0%   -38.8
Count/4K       9.24µs ± 0%    4.93µs ± 0%   -46
Count/4M       9.52ms ± 0%    5.10ms ± 0%   -46.4
Count/64M       155ms ± 0%      85ms ± 0%   -45.1

IndexHard1     2.36ms ± 0%    0.13ms ± 0%   -94.4
IndexHard2     2.36ms ± 0%    1.28ms ± 0%   -45.8
IndexHard3     2.36ms ± 0%    1.19ms ± 0%   -49.4
IndexHard4     2.36ms ± 0%    2.35ms ± 0%    -0.1

CountHard1     2.36ms ± 0%    0.13ms ± 0%   -94.4
CountHard2     2.36ms ± 0%    1.28ms ± 0%   -45.8
CountHard3     2.36ms ± 0%    1.19ms ± 0%   -49.4

IndexPeriodic/IndexPeriodic2  146µs ± 0%       8µs ± 0%   -94
IndexPeriodic/IndexPeriodic4  146µs ± 0%       8µs ± 0%   -94

Change-Id: I7dd2bb7e278726e27f51825ca8b2f8317d460e60
Reviewed-on: https://go-review.googlesource.com/c/go/+/309730
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
Trust: Carlos Eduardo Seo <carlos.seo@linaro.org>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2021-04-15 17:32:04 +00:00
Austin Clements 8f4c5068e0 internal/bytealg: port more performance-critical functions to ABIInternal
CL 308931 ported several runtime assembly functions to ABIInternal so
that compiler-generated ABIInternal calls don't go through ABI
wrappers, but it missed the runtime assembly functions that are
actually defined in internal/bytealg.

This eliminates the cost of wrappers for the BleveQuery and
GopherLuaKNucleotide benchmarks, but there's still more to do for
Tile38.

                                      0-base                1-wrappers
                                     sec/op        sec/op            vs base
BleveQuery                          6.507 ± 0%    6.477 ± 0%  -0.46% (p=0.004 n=20)
GopherLuaKNucleotide                30.39 ± 1%    30.34 ± 0%       ~ (p=0.301 n=20)
Tile38IntersectsCircle100kmRequest 1.038m ± 1%   1.080m ± 2%  +4.03% (p=0.000 n=20)

For #40724.

Change-Id: I0b722443f684fcb997b1d70802c5ed4b8d8f9829
Reviewed-on: https://go-review.googlesource.com/c/go/+/310184
Trust: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-04-15 04:10:33 +00:00
Russ Cox d4b2638234 all: go fmt std cmd (but revert vendor)
Make all our package sources use Go 1.17 gofmt format
(adding //go:build lines).

Part of //go:build change (#41184).
See https://golang.org/design/draft-gobuild

Change-Id: Ia0534360e4957e58cd9a18429c39d0e32a6addb4
Reviewed-on: https://go-review.googlesource.com/c/go/+/294430
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2021-02-20 03:54:50 +00:00
Meng Zhuo 4f597abe77 internal/bytealg: improve mips64x equal on large size
name               old time/op    new time/op    delta
Equal/0              9.94ns ± 4%    9.12ns ± 5%     -8.26%  (p=0.000 n=10+10)
Equal/1              24.5ns ± 0%    27.2ns ± 1%    +11.22%  (p=0.000 n=9+10)
Equal/6              28.1ns ± 0%    32.1ns ± 1%    +14.20%  (p=0.000 n=8+10)
Equal/9              37.1ns ± 0%    37.8ns ± 1%     +1.95%  (p=0.000 n=8+9)
Equal/15             47.3ns ± 0%    44.3ns ± 0%     -6.34%  (p=0.000 n=9+10)
Equal/16             42.9ns ± 0%    24.6ns ± 0%    -42.66%  (p=0.000 n=10+7)
Equal/20             44.3ns ± 0%    57.4ns ± 0%    +29.57%  (p=0.000 n=9+10)
Equal/32             63.2ns ± 0%    35.8ns ± 0%    -43.35%  (p=0.000 n=10+10)
Equal/4K             6.49µs ± 0%    0.50µs ± 0%    -92.27%  (p=0.000 n=10+8)
Equal/4M             6.70ms ± 0%    0.48ms ± 0%    -92.78%  (p=0.000 n=8+10)
Equal/64M             110ms ± 0%       8ms ± 0%    -92.65%  (p=0.000 n=9+9)
CompareBytesEqual    36.6ns ± 0%    35.9ns ± 0%     -1.83%  (p=0.000 n=10+9)

name               old speed      new speed      delta
Equal/1            40.8MB/s ± 0%  36.7MB/s ± 0%    -10.16%  (p=0.000 n=10+10)
Equal/6             213MB/s ± 0%   187MB/s ± 1%    -12.32%  (p=0.000 n=10+10)
Equal/9             243MB/s ± 0%   238MB/s ± 1%     -1.94%  (p=0.000 n=9+10)
Equal/15            317MB/s ± 0%   339MB/s ± 0%     +6.86%  (p=0.000 n=9+9)
Equal/16            373MB/s ± 0%   651MB/s ± 0%    +74.70%  (p=0.000 n=8+10)
Equal/20            452MB/s ± 0%   348MB/s ± 0%    -22.90%  (p=0.000 n=8+10)
Equal/32            506MB/s ± 0%   893MB/s ± 0%    +76.53%  (p=0.000 n=10+9)
Equal/4K            631MB/s ± 0%  8166MB/s ± 0%  +1194.73%  (p=0.000 n=10+10)
Equal/4M            626MB/s ± 0%  8673MB/s ± 0%  +1284.94%  (p=0.000 n=8+10)
Equal/64M           608MB/s ± 0%  8277MB/s ± 0%  +1260.83%  (p=0.000 n=9+9)

Change-Id: I1cd14ade16390a5097a8d4e9721d5e822fa6218f
Reviewed-on: https://go-review.googlesource.com/c/go/+/199597
Run-TryBot: Meng Zhuo <mzh@golangcn.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Trust: Meng Zhuo <mzh@golangcn.org>
2020-10-23 15:53:01 +00:00
Tobias Klauser 19f6422e00 internal/bytealg: add assembly implementation of Count/CountString for riscv64
Simple single-byte loop count for now, to be further improved in future
CLs.

Benchmark on linux/riscv64 (HiFive Unleashed):

name               old time/op    new time/op     delta
CountSingle/10-4      190ns ± 1%      145ns ± 1%  -23.66%  (p=0.000 n=10+9)
CountSingle/32-4      422ns ± 1%      268ns ± 0%  -36.43%  (p=0.000 n=10+7)
CountSingle/4K-4     43.3µs ± 0%     23.8µs ± 0%  -45.09%  (p=0.000 n=8+10)
CountSingle/4M-4     54.2ms ± 1%     33.3ms ± 1%  -38.48%  (p=0.000 n=10+10)
CountSingle/64M-4     1.52s ± 1%      1.20s ± 1%  -21.20%  (p=0.000 n=9+9)

name               old speed      new speed       delta
CountSingle/10-4   52.7MB/s ± 1%   69.1MB/s ± 1%  +31.03%  (p=0.000 n=10+9)
CountSingle/32-4   75.9MB/s ± 1%  119.5MB/s ± 0%  +57.34%  (p=0.000 n=10+8)
CountSingle/4K-4   94.6MB/s ± 0%  172.2MB/s ± 0%  +82.10%  (p=0.000 n=8+10)
CountSingle/4M-4   77.4MB/s ± 1%  125.8MB/s ± 1%  +62.54%  (p=0.000 n=10+10)
CountSingle/64M-4  44.2MB/s ± 1%   56.1MB/s ± 1%  +26.91%  (p=0.000 n=9+9)

Change-Id: I2a6bd50d22d5f598517bb3c5a50066c54280cac5
Reviewed-on: https://go-review.googlesource.com/c/go/+/263541
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
2020-10-19 14:55:41 +00:00
Tobias Klauser e69f6e8393 internal/bytealg: fix typo in IndexRabinKarp{,Bytes} godoc
Change-Id: I09ba19e19b195e345a0fe29d542e0d86529b0d31
Reviewed-on: https://go-review.googlesource.com/c/go/+/261359
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2020-10-13 06:48:02 +00:00
Michael Munday 11cdbab9d4 bytes, internal/bytealg: fix incorrect IndexString usage
The IndexString implementation in the bytealg package requires that
the string passed into it be in the range '2 <= len(s) <= MaxLen'
where MaxLen may be any value (including 0).

CL 156998 added calls to bytealg.IndexString where MaxLen was not
first checked. This led to an illegal instruction on s390x with
the vector facility disabled.

This CL guards the calls to bytealg.IndexString with a MaxLen check.
If the check fails then the code now falls back to the pre CL 156998
implementation (a loop over the runes in the string).

Since the MaxLen check is now in place the generic implementation is
no longer called so I have returned it to its original unimplemented
state.

In future we may want to drop MaxLen to prevent this kind of
confusion.

Fixes #41552.

Change-Id: Ibeb3f08720444a05c08d719ed97f6cef2423bbe9
Reviewed-on: https://go-review.googlesource.com/c/go/+/256717
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Michael Munday <mike.munday@ibm.com>
Reviewed-by: Keith Randall <khr@golang.org>
2020-09-23 19:55:33 +00:00
Heisenberg 99d6e3eec2 internal/bytealg: use CBZ instructions
Use CBZ to replace the comparison and jump to the zero instruction in the arm64 assembly file.

Change-Id: Ie16fb52e27b4d327343e119ebc0f0ca756437bc4
Reviewed-on: https://go-review.googlesource.com/c/go/+/237477
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2020-08-17 21:02:28 +00:00
Keith Randall c6a11f0dd2 crypto,internal/bytealg: fix assembly that clobbers BP
BP should be callee-save. It will be saved automatically if
there is a nonzero frame size. Otherwise, we need to avoid this register.

Change-Id: If3f551efa42d830c8793d9f0183cb8daad7a2ab5
Reviewed-on: https://go-review.googlesource.com/c/go/+/248260
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2020-08-16 17:05:18 +00:00
erifan01 3af92acb9f strings, bytes: improve IndexAny and LastIndexAny performance
For the case of a pattern containing multi-byte rune, the time complexity of the
previous algorithm is O(nm), and if both input arguments are long, the search
performance will be poor. This CL improves the searching performance for these
cases by using IndexRune, which is mainly implemented with IndexByte and Index.
As IndexByte and Index are specially optimized with some powerful instructions
for short patterns (an UTF8 rune is 1 to 4 bytes), so they can help to reduce the
runtime complexity of IndexAny and LastIndexAny.

Another optimization method is using hash table, however, the actual test results
show that using indexrune is better, and the space complexity is lower.

There are two fast paths in IndexAny and LastIndexAny for cases where the length
of the input arguements are 1, and their locations are not exactly the same, which
is determined based on the actual test results.

Benchmarks on arm64 and amd64:

name                        old time/op  new time/op  delta
pkg:strings goos:linux goarch:arm64
IndexAnyASCII/1:1-8         23.7ns ± 3%  28.5ns ± 0%  +20.15%  (p=0.008 n=5+5)
IndexAnyASCII/1:2-8         18.0ns ± 0%  33.1ns ± 0%  +83.67%  (p=0.008 n=5+5)
IndexAnyASCII/1:4-8         20.0ns ± 0%  36.0ns ± 0%  +80.00%  (p=0.029 n=4+4)
IndexAnyASCII/1:8-8         36.1ns ± 0%  36.0ns ± 0%     ~     (p=0.095 n=5+4)
IndexAnyASCII/1:16-8        48.1ns ± 0%  36.0ns ± 0%  -25.19%  (p=0.029 n=4+4)
IndexAnyASCII/1:32-8        72.1ns ± 0%  36.0ns ± 0%  -50.01%  (p=0.008 n=5+5)
IndexAnyASCII/1:64-8         120ns ± 0%    39ns ± 0%  -67.83%  (p=0.008 n=5+5)
IndexAnyASCII/16:1-8        73.0ns ± 0%  28.5ns ± 0%  -60.95%  (p=0.008 n=5+5)
IndexAnyASCII/16:2-8        76.8ns ± 0%  77.0ns ± 0%     ~     (p=1.000 n=5+5)
IndexAnyASCII/16:4-8        83.2ns ± 1%  83.0ns ± 0%     ~     (p=0.770 n=5+5)
IndexAnyASCII/16:8-8         111ns ± 1%   107ns ± 0%   -3.25%  (p=0.008 n=5+5)
IndexAnyASCII/16:16-8        139ns ± 1%   137ns ± 0%   -1.58%  (p=0.008 n=5+5)
IndexAnyASCII/16:32-8        199ns ± 1%   197ns ± 0%   -1.20%  (p=0.008 n=5+5)
IndexAnyASCII/16:64-8        307ns ± 0%   313ns ± 0%   +1.82%  (p=0.016 n=5+4)
IndexAnyASCII/256:1-8        674ns ± 0%    65ns ± 0%  -90.31%  (p=0.008 n=5+5)
IndexAnyASCII/256:2-8        678ns ± 0%   683ns ± 0%   +0.68%  (p=0.008 n=5+5)
IndexAnyASCII/256:4-8        685ns ± 0%   683ns ± 0%   -0.29%  (p=0.000 n=5+4)
IndexAnyASCII/256:8-8        711ns ± 0%   708ns ± 0%   -0.48%  (p=0.008 n=5+5)
IndexAnyASCII/256:16-8       740ns ± 0%   740ns ± 0%     ~     (p=0.444 n=5+5)
IndexAnyASCII/256:32-8       799ns ± 0%   798ns ± 0%   -0.18%  (p=0.008 n=5+5)
IndexAnyASCII/256:64-8       910ns ± 0%   914ns ± 0%   +0.44%  (p=0.016 n=4+5)
IndexAnyUTF8/1:1-8          27.1ns ± 0%  19.0ns ± 0%  -29.79%  (p=0.008 n=5+5)
IndexAnyUTF8/1:2-8          44.1ns ± 0%  33.0ns ± 0%  -25.17%  (p=0.008 n=5+5)
IndexAnyUTF8/1:4-8          46.1ns ± 0%  33.1ns ± 0%  -28.29%  (p=0.016 n=4+5)
IndexAnyUTF8/1:8-8          85.1ns ± 0%  33.0ns ± 0%  -61.18%  (p=0.008 n=5+5)
IndexAnyUTF8/1:16-8          110ns ± 1%    36ns ± 0%  -67.27%  (p=0.008 n=5+5)
IndexAnyUTF8/1:32-8          188ns ± 0%    36ns ± 0%  -80.85%  (p=0.008 n=5+5)
IndexAnyUTF8/1:64-8          332ns ± 0%    39ns ± 0%     ~     (p=0.079 n=4+5)
IndexAnyUTF8/16:1-8          293ns ± 0%    54ns ± 0%  -81.56%  (p=0.008 n=5+5)
IndexAnyUTF8/16:2-8          563ns ± 0%   349ns ± 0%  -37.98%  (p=0.008 n=5+5)
IndexAnyUTF8/16:4-8          546ns ± 1%   349ns ± 0%  -36.10%  (p=0.000 n=5+4)
IndexAnyUTF8/16:8-8         1.22µs ± 0%  0.35µs ± 0%  -71.39%  (p=0.008 n=5+5)
IndexAnyUTF8/16:16-8        1.63µs ± 1%  0.42µs ± 0%  -73.98%  (p=0.008 n=5+5)
IndexAnyUTF8/16:32-8        2.87µs ± 0%  0.42µs ± 0%  -85.22%  (p=0.008 n=5+5)
IndexAnyUTF8/16:64-8        5.18µs ± 0%  0.47µs ± 0%  -90.98%  (p=0.008 n=5+5)
IndexAnyUTF8/256:1-8        4.26µs ± 0%  0.47µs ± 0%  -88.85%  (p=0.000 n=4+5)
IndexAnyUTF8/256:2-8        8.62µs ± 0%  5.15µs ± 0%  -40.21%  (p=0.008 n=5+5)
IndexAnyUTF8/256:4-8        8.25µs ± 0%  5.15µs ± 0%  -37.50%  (p=0.016 n=5+4)
IndexAnyUTF8/256:8-8        19.2µs ± 1%   5.2µs ± 0%  -73.08%  (p=0.016 n=5+4)
IndexAnyUTF8/256:16-8       25.6µs ± 1%   6.3µs ± 0%  -75.32%  (p=0.008 n=5+5)
IndexAnyUTF8/256:32-8       45.6µs ± 0%   6.3µs ± 0%  -86.15%  (p=0.008 n=5+5)
IndexAnyUTF8/256:64-8       82.4µs ± 0%   7.0µs ± 0%  -91.53%  (p=0.016 n=5+4)
LastIndexAnyASCII/1:1-8     23.0ns ± 0%  33.5ns ± 0%  +45.65%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:2-8     24.5ns ± 0%  33.5ns ± 0%  +36.73%  (p=0.016 n=4+5)
LastIndexAnyASCII/1:4-8     27.5ns ± 0%  35.5ns ± 0%  +29.09%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:8-8     44.5ns ± 0%  35.5ns ± 0%  -20.13%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:16-8    56.5ns ± 0%  35.5ns ± 0%  -37.15%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:32-8    80.3ns ± 0%  35.5ns ± 0%  -55.79%  (p=0.000 n=5+4)
LastIndexAnyASCII/1:64-8     129ns ± 0%    40ns ± 0%  -68.85%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:1-8    72.8ns ± 0%  72.7ns ± 0%   -0.19%  (p=0.016 n=4+5)
LastIndexAnyASCII/16:2-8    75.4ns ± 0%  75.1ns ± 0%     ~     (p=0.127 n=5+5)
LastIndexAnyASCII/16:4-8    81.9ns ± 1%  80.2ns ± 0%   -2.00%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:8-8     110ns ± 1%   108ns ± 0%   -1.46%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:16-8    138ns ± 1%   134ns ± 0%   -3.18%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:32-8    198ns ± 0%   197ns ± 0%   -0.51%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:64-8    309ns ± 0%   313ns ± 0%   +1.30%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:1-8    652ns ± 0%   653ns ± 0%   +0.21%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:2-8    656ns ± 0%   656ns ± 0%     ~     (all equal)
LastIndexAnyASCII/256:4-8    663ns ± 0%   663ns ± 0%     ~     (p=0.444 n=5+5)
LastIndexAnyASCII/256:8-8    691ns ± 0%   690ns ± 0%     ~     (p=0.079 n=4+5)
LastIndexAnyASCII/256:16-8   719ns ± 0%   715ns ± 0%   -0.53%  (p=0.000 n=5+4)
LastIndexAnyASCII/256:32-8   779ns ± 0%   780ns ± 0%   +0.13%  (p=0.029 n=4+4)
LastIndexAnyASCII/256:64-8   890ns ± 0%   894ns ± 0%   +0.45%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:1-8      31.6ns ± 0%  33.5ns ± 0%   +6.01%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:2-8      48.6ns ± 0%  33.5ns ± 0%  -30.99%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:4-8      48.6ns ± 0%  33.5ns ± 0%  -31.13%  (p=0.000 n=5+4)
LastIndexAnyUTF8/1:8-8      89.6ns ± 0%  33.5ns ± 0%  -62.56%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:16-8      113ns ± 1%    36ns ± 0%  -68.47%  (p=0.000 n=5+4)
LastIndexAnyUTF8/1:32-8      190ns ± 0%    36ns ± 0%  -81.26%  (p=0.029 n=4+4)
LastIndexAnyUTF8/1:64-8      327ns ± 0%    40ns ± 0%  -87.77%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:1-8      364ns ± 0%   158ns ± 0%     ~     (p=0.079 n=4+5)
LastIndexAnyUTF8/16:2-8      636ns ± 0%   472ns ± 0%  -25.79%  (p=0.000 n=5+4)
LastIndexAnyUTF8/16:4-8      630ns ± 0%   472ns ± 0%  -25.03%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:8-8     1.28µs ± 0%  0.47µs ± 0%  -63.09%  (p=0.016 n=5+4)
LastIndexAnyUTF8/16:16-8    1.66µs ± 0%  0.53µs ± 0%  -68.39%  (p=0.016 n=5+4)
LastIndexAnyUTF8/16:32-8    2.88µs ± 0%  0.53µs ± 0%  -81.72%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:64-8    5.08µs ± 0%  0.57µs ± 0%  -88.79%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:1-8    5.41µs ± 0%  2.03µs ± 0%  -62.46%  (p=0.016 n=4+5)
LastIndexAnyUTF8/256:2-8    9.77µs ± 0%  7.14µs ± 0%  -26.97%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:4-8    9.63µs ± 0%  7.14µs ± 0%  -25.86%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:8-8    20.0µs ± 0%   7.1µs ± 0%  -64.30%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:16-8   26.1µs ± 1%   8.0µs ± 0%  -69.40%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:32-8   45.6µs ± 1%   8.0µs ± 0%  -82.51%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:64-8   80.8µs ± 0%   8.6µs ± 0%  -89.33%  (p=0.016 n=5+4)
pkg:bytes goos:linux goarch:arm64
IndexAnyASCII/1:1-8         26.2ns ± 1%  26.5ns ± 0%   +1.30%  (p=0.016 n=5+4)
IndexAnyASCII/1:2-8         18.5ns ± 0%  26.5ns ± 0%  +43.24%  (p=0.008 n=5+5)
IndexAnyASCII/1:4-8         21.0ns ± 0%  26.5ns ± 0%  +26.38%  (p=0.008 n=5+5)
IndexAnyASCII/1:8-8         37.5ns ± 0%  26.5ns ± 0%  -29.33%  (p=0.000 n=5+4)
IndexAnyASCII/1:16-8        49.6ns ± 0%  26.5ns ± 0%  -46.49%  (p=0.008 n=5+5)
IndexAnyASCII/1:32-8        73.6ns ± 0%  30.1ns ± 0%  -59.16%  (p=0.008 n=5+5)
IndexAnyASCII/1:64-8         122ns ± 0%    33ns ± 0%  -73.23%  (p=0.008 n=5+5)
IndexAnyASCII/16:1-8        73.7ns ± 0%  33.4ns ± 0%  -54.71%  (p=0.008 n=5+5)
IndexAnyASCII/16:2-8        79.1ns ± 0%  78.9ns ± 0%   -0.30%  (p=0.016 n=4+5)
IndexAnyASCII/16:4-8        84.8ns ± 0%  86.1ns ± 0%   +1.58%  (p=0.016 n=5+4)
IndexAnyASCII/16:8-8         111ns ± 0%   111ns ± 0%     ~     (all equal)
IndexAnyASCII/16:16-8        139ns ± 0%   144ns ± 0%   +3.60%  (p=0.016 n=4+5)
IndexAnyASCII/16:32-8        196ns ± 0%   207ns ± 0%   +5.61%  (p=0.016 n=5+4)
IndexAnyASCII/16:64-8        311ns ± 0%   320ns ± 0%   +2.89%  (p=0.016 n=4+5)
IndexAnyASCII/256:1-8        674ns ± 0%    65ns ± 1%  -90.35%  (p=0.008 n=5+5)
IndexAnyASCII/256:2-8        680ns ± 0%   680ns ± 0%     ~     (p=0.444 n=5+5)
IndexAnyASCII/256:4-8        686ns ± 0%   687ns ± 0%     ~     (p=0.167 n=5+5)
IndexAnyASCII/256:8-8        713ns ± 0%   712ns ± 0%   -0.14%  (p=0.008 n=5+5)
IndexAnyASCII/256:16-8       740ns ± 0%   744ns ± 0%   +0.54%  (p=0.016 n=5+4)
IndexAnyASCII/256:32-8       797ns ± 0%   808ns ± 0%   +1.43%  (p=0.008 n=5+5)
IndexAnyASCII/256:64-8       912ns ± 0%   921ns ± 0%   +0.99%  (p=0.016 n=4+5)
IndexAnyUTF8/1:1-8          27.5ns ± 0%  26.5ns ± 0%   -3.64%  (p=0.008 n=5+5)
IndexAnyUTF8/1:2-8          44.5ns ± 0%  26.5ns ± 0%  -40.50%  (p=0.008 n=5+5)
IndexAnyUTF8/1:4-8          45.6ns ± 0%  26.5ns ± 0%  -41.89%  (p=0.000 n=5+4)
IndexAnyUTF8/1:8-8          85.8ns ± 1%  26.5ns ± 0%  -69.11%  (p=0.008 n=5+5)
IndexAnyUTF8/1:16-8          110ns ± 1%    26ns ± 0%  -76.00%  (p=0.016 n=5+4)
IndexAnyUTF8/1:32-8          188ns ± 0%    30ns ± 0%  -84.04%  (p=0.008 n=5+5)
IndexAnyUTF8/1:64-8          333ns ± 0%    33ns ± 0%  -90.20%  (p=0.008 n=5+5)
IndexAnyUTF8/16:1-8          294ns ± 0%   235ns ± 0%  -20.07%  (p=0.008 n=5+5)
IndexAnyUTF8/16:2-8          563ns ± 0%   309ns ± 0%  -45.12%  (p=0.008 n=5+5)
IndexAnyUTF8/16:4-8          558ns ± 1%   309ns ± 0%  -44.60%  (p=0.000 n=5+4)
IndexAnyUTF8/16:8-8         1.23µs ± 0%  0.31µs ± 0%  -74.79%  (p=0.008 n=5+5)
IndexAnyUTF8/16:16-8        1.62µs ± 2%  0.31µs ± 0%  -80.93%  (p=0.008 n=5+5)
IndexAnyUTF8/16:32-8        2.86µs ± 0%  0.38µs ± 0%  -86.87%  (p=0.008 n=5+5)
IndexAnyUTF8/16:64-8        5.18µs ± 0%  0.42µs ± 0%  -91.86%  (p=0.008 n=5+5)
IndexAnyUTF8/256:1-8        4.27µs ± 1%  3.30µs ± 1%  -22.75%  (p=0.008 n=5+5)
IndexAnyUTF8/256:2-8        8.61µs ± 0%  4.45µs ± 0%  -48.31%  (p=0.016 n=4+5)
IndexAnyUTF8/256:4-8        8.44µs ± 0%  4.45µs ± 0%  -47.23%  (p=0.008 n=5+5)
IndexAnyUTF8/256:8-8        19.2µs ± 0%   4.5µs ± 0%  -76.78%  (p=0.008 n=5+5)
IndexAnyUTF8/256:16-8       25.6µs ± 0%   4.5µs ± 0%  -82.63%  (p=0.008 n=5+5)
IndexAnyUTF8/256:32-8       45.4µs ± 0%   5.5µs ± 0%  -87.85%  (p=0.016 n=4+5)
IndexAnyUTF8/256:64-8       82.5µs ± 0%   6.2µs ± 0%  -92.49%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:1-8     23.0ns ± 0%  26.5ns ± 0%  +15.02%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:2-8     24.5ns ± 0%  26.5ns ± 0%   +8.16%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:4-8     27.8ns ± 0%  26.5ns ± 0%   -4.68%  (p=0.029 n=4+4)
LastIndexAnyASCII/1:8-8     45.1ns ± 1%  26.5ns ± 0%  -41.29%  (p=0.000 n=5+4)
LastIndexAnyASCII/1:16-8    57.1ns ± 0%  26.5ns ± 0%  -53.61%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:32-8    81.5ns ± 0%  30.0ns ± 0%     ~     (p=0.079 n=4+5)
LastIndexAnyASCII/1:64-8     129ns ± 0%    32ns ± 0%  -74.81%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:1-8    72.6ns ± 0%  72.1ns ± 0%   -0.63%  (p=0.000 n=4+5)
LastIndexAnyASCII/16:2-8    77.2ns ± 0%  77.2ns ± 0%     ~     (p=0.167 n=5+5)
LastIndexAnyASCII/16:4-8    83.1ns ± 0%  83.2ns ± 0%     ~     (p=0.444 n=5+5)
LastIndexAnyASCII/16:8-8     109ns ± 1%   108ns ± 0%     ~     (p=0.167 n=5+5)
LastIndexAnyASCII/16:16-8    136ns ± 0%   136ns ± 0%     ~     (all equal)
LastIndexAnyASCII/16:32-8    195ns ± 0%   197ns ± 0%   +0.82%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:64-8    309ns ± 0%   309ns ± 0%     ~     (all equal)
LastIndexAnyASCII/256:1-8    653ns ± 0%   657ns ± 0%   +0.61%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:2-8    659ns ± 0%   658ns ± 0%     ~     (p=0.167 n=5+5)
LastIndexAnyASCII/256:4-8    664ns ± 0%   663ns ± 0%     ~     (p=0.095 n=5+4)
LastIndexAnyASCII/256:8-8    698ns ± 0%   689ns ± 0%   -1.29%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:16-8   726ns ± 0%   717ns ± 0%   -1.24%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:32-8   777ns ± 0%   779ns ± 0%     ~     (p=0.079 n=5+4)
LastIndexAnyASCII/256:64-8   889ns ± 0%   890ns ± 0%     ~     (p=0.444 n=5+5)
LastIndexAnyUTF8/1:1-8      32.1ns ± 0%  26.5ns ± 0%  -17.45%  (p=0.000 n=5+4)
LastIndexAnyUTF8/1:2-8      48.6ns ± 0%  26.5ns ± 0%  -45.52%  (p=0.000 n=5+4)
LastIndexAnyUTF8/1:4-8      49.6ns ± 0%  26.5ns ± 0%  -46.62%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:8-8      91.9ns ± 0%  26.5ns ± 0%  -71.18%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:16-8      114ns ± 1%    26ns ± 0%  -76.84%  (p=0.000 n=5+4)
LastIndexAnyUTF8/1:32-8      203ns ± 6%    30ns ± 0%  -85.25%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:64-8      330ns ± 0%    33ns ± 0%  -90.14%  (p=0.000 n=4+5)
LastIndexAnyUTF8/16:1-8      365ns ± 0%   164ns ± 0%  -55.04%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:2-8      638ns ± 0%   296ns ± 0%  -53.58%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:4-8      634ns ± 0%   296ns ± 0%  -53.31%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:8-8     1.30µs ± 0%  0.30µs ± 0%  -77.18%  (p=0.000 n=4+5)
LastIndexAnyUTF8/16:16-8    1.66µs ± 0%  0.30µs ± 0%  -82.19%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:32-8    2.90µs ± 0%  0.38µs ± 0%  -87.00%  (p=0.029 n=4+4)
LastIndexAnyUTF8/16:64-8    5.10µs ± 0%  0.42µs ± 0%  -91.78%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:1-8    5.42µs ± 0%  2.12µs ± 0%  -60.92%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:2-8    9.79µs ± 0%  4.26µs ± 0%  -56.47%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:4-8    9.66µs ± 0%  4.26µs ± 0%  -55.87%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:8-8    20.4µs ± 0%   4.3µs ± 0%  -79.10%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:16-8   26.0µs ± 1%   4.3µs ± 0%  -83.62%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:32-8   46.0µs ± 0%   5.5µs ± 0%  -88.09%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:64-8   81.1µs ± 0%   6.2µs ± 0%  -92.38%  (p=0.008 n=5+5)

name                         old time/op  new time/op   delta
pkg:strings goos:linux goarch:amd64
IndexAnyASCII/1:1-48         10.0ns ± 0%   13.3ns ± 0%  +33.00%  (p=0.008 n=5+5)
IndexAnyASCII/1:2-48         11.0ns ± 0%   15.5ns ± 0%  +40.55%  (p=0.016 n=4+5)
IndexAnyASCII/1:4-48         12.9ns ± 0%   15.4ns ± 0%  +19.69%  (p=0.008 n=5+5)
IndexAnyASCII/1:8-48         18.6ns ± 0%   15.5ns ± 0%  -16.45%  (p=0.000 n=4+5)
IndexAnyASCII/1:16-48        30.1ns ± 0%   16.9ns ± 0%     ~     (p=0.079 n=4+5)
IndexAnyASCII/1:32-48        53.1ns ± 0%   18.6ns ± 0%  -64.95%  (p=0.000 n=5+4)
IndexAnyASCII/1:64-48        98.9ns ± 0%   17.4ns ± 0%  -82.41%  (p=0.000 n=5+4)
IndexAnyASCII/16:1-48        35.0ns ± 0%   14.2ns ± 0%  -59.47%  (p=0.000 n=5+4)
IndexAnyASCII/16:2-48        35.5ns ± 0%   35.6ns ± 0%     ~     (p=0.238 n=5+4)
IndexAnyASCII/16:4-48        40.8ns ± 0%   40.7ns ± 1%     ~     (p=0.643 n=5+5)
IndexAnyASCII/16:8-48        50.8ns ± 0%   50.9ns ± 1%     ~     (p=1.000 n=4+5)
IndexAnyASCII/16:16-48       64.0ns ± 1%   64.5ns ± 1%     ~     (p=0.071 n=5+5)
IndexAnyASCII/16:32-48       98.3ns ± 0%  100.8ns ± 1%   +2.52%  (p=0.008 n=5+5)
IndexAnyASCII/16:64-48        156ns ± 0%    157ns ± 0%     ~     (p=0.238 n=4+5)
IndexAnyASCII/256:1-48        299ns ± 0%     24ns ± 3%  -92.12%  (p=0.008 n=5+5)
IndexAnyASCII/256:2-48        303ns ± 0%    304ns ± 0%     ~     (p=0.762 n=5+5)
IndexAnyASCII/256:4-48        311ns ± 0%    311ns ± 0%     ~     (p=0.476 n=5+5)
IndexAnyASCII/256:8-48        321ns ± 0%    321ns ± 0%     ~     (p=0.429 n=4+5)
IndexAnyASCII/256:16-48       334ns ± 0%    335ns ± 0%     ~     (p=0.079 n=5+4)
IndexAnyASCII/256:32-48       367ns ± 0%    365ns ± 0%     ~     (p=0.079 n=4+5)
IndexAnyASCII/256:64-48       431ns ± 1%    421ns ± 0%   -2.27%  (p=0.008 n=5+5)
IndexAnyUTF8/1:1-48          17.2ns ± 0%   10.8ns ± 0%  -37.21%  (p=0.029 n=4+4)
IndexAnyUTF8/1:2-48          26.7ns ± 0%   15.6ns ± 0%     ~     (p=0.079 n=4+5)
IndexAnyUTF8/1:4-48          28.2ns ± 0%   15.6ns ± 0%  -44.68%  (p=0.000 n=5+4)
IndexAnyUTF8/1:8-48          48.8ns ± 0%   15.6ns ± 0%  -68.03%  (p=0.029 n=4+4)
IndexAnyUTF8/1:16-48         58.3ns ± 0%   16.2ns ± 0%     ~     (p=0.079 n=4+5)
IndexAnyUTF8/1:32-48          103ns ± 0%     18ns ± 0%  -82.27%  (p=0.008 n=5+5)
IndexAnyUTF8/1:64-48          182ns ± 0%     17ns ± 0%  -90.53%  (p=0.008 n=5+5)
IndexAnyUTF8/16:1-48          197ns ± 0%     25ns ± 0%  -87.34%  (p=0.000 n=5+4)
IndexAnyUTF8/16:2-48          348ns ± 0%    163ns ± 0%  -53.11%  (p=0.000 n=5+4)
IndexAnyUTF8/16:4-48          374ns ± 0%    163ns ± 0%  -56.37%  (p=0.000 n=5+4)
IndexAnyUTF8/16:8-48          716ns ± 0%    163ns ± 0%  -77.22%  (p=0.000 n=5+4)
IndexAnyUTF8/16:16-48         859ns ± 0%    175ns ± 0%  -79.63%  (p=0.000 n=5+4)
IndexAnyUTF8/16:32-48        1.58µs ± 0%   0.20µs ± 0%  -87.01%  (p=0.029 n=4+4)
IndexAnyUTF8/16:64-48        2.84µs ± 0%   0.19µs ± 1%  -93.34%  (p=0.008 n=5+5)
IndexAnyUTF8/256:1-48        2.61µs ± 0%   0.27µs ± 0%  -89.81%  (p=0.008 n=5+5)
IndexAnyUTF8/256:2-48        4.95µs ± 0%   2.23µs ± 0%  -54.91%  (p=0.016 n=5+4)
IndexAnyUTF8/256:4-48        5.55µs ± 0%   2.23µs ± 0%  -59.72%  (p=0.008 n=5+5)
IndexAnyUTF8/256:8-48        10.8µs ± 0%    2.2µs ± 0%  -79.39%  (p=0.008 n=5+5)
IndexAnyUTF8/256:16-48       13.1µs ± 0%    2.5µs ± 0%  -81.21%  (p=0.016 n=4+5)
IndexAnyUTF8/256:32-48       24.7µs ± 0%    2.8µs ± 0%  -88.49%  (p=0.008 n=5+5)
IndexAnyUTF8/256:64-48       45.0µs ± 0%    2.6µs ± 1%  -94.23%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:1-48     13.9ns ± 0%   15.2ns ± 0%   +9.35%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:2-48     14.4ns ± 0%   15.2ns ± 0%   +5.56%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:4-48     16.7ns ± 0%   15.2ns ± 0%   -8.98%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:8-48     24.0ns ± 0%   15.2ns ± 0%  -36.67%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:16-48    35.6ns ± 0%   15.0ns ± 0%  -57.82%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:32-48    68.9ns ± 0%   16.7ns ± 0%  -75.75%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:64-48     104ns ± 0%     17ns ± 1%  -83.81%  (p=0.008 n=5+5)
LastIndexAnyASCII/16:1-48    35.0ns ± 0%   35.0ns ± 0%     ~     (all equal)
LastIndexAnyASCII/16:2-48    35.6ns ± 0%   35.6ns ± 0%     ~     (all equal)
LastIndexAnyASCII/16:4-48    41.0ns ± 0%   40.8ns ± 0%   -0.49%  (p=0.032 n=5+5)
LastIndexAnyASCII/16:8-48    50.9ns ± 0%   50.7ns ± 1%     ~     (p=0.397 n=5+5)
LastIndexAnyASCII/16:16-48   64.3ns ± 1%   64.4ns ± 1%     ~     (p=1.000 n=4+5)
LastIndexAnyASCII/16:32-48    100ns ± 0%    100ns ± 0%   +0.38%  (p=0.016 n=4+5)
LastIndexAnyASCII/16:64-48    157ns ± 1%    163ns ± 0%   +3.82%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:1-48    302ns ± 0%    300ns ± 0%   -0.53%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:2-48    305ns ± 0%    303ns ± 0%   -0.66%  (p=0.000 n=5+4)
LastIndexAnyASCII/256:4-48    313ns ± 0%    307ns ± 0%   -2.04%  (p=0.000 n=4+5)
LastIndexAnyASCII/256:8-48    323ns ± 0%    315ns ± 0%   -2.48%  (p=0.029 n=4+4)
LastIndexAnyASCII/256:16-48   333ns ± 0%    332ns ± 0%   -0.30%  (p=0.048 n=5+5)
LastIndexAnyASCII/256:32-48   366ns ± 0%    367ns ± 0%     ~     (p=0.238 n=4+5)
LastIndexAnyASCII/256:64-48   430ns ± 0%    430ns ± 0%     ~     (p=1.000 n=5+5)
LastIndexAnyUTF8/1:1-48      21.1ns ± 0%   13.9ns ± 0%  -34.00%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:2-48      29.5ns ± 0%   13.9ns ± 0%  -52.95%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:4-48      31.6ns ± 0%   13.9ns ± 0%  -55.96%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:8-48      51.1ns ± 0%   13.9ns ± 0%  -72.81%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:16-48     58.9ns ± 0%   14.6ns ± 0%  -75.23%  (p=0.016 n=5+4)
LastIndexAnyUTF8/1:32-48      103ns ± 0%     16ns ± 1%  -84.12%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:64-48      177ns ± 0%     17ns ± 1%  -90.62%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:1-48      275ns ± 1%    105ns ± 0%  -61.85%  (p=0.000 n=5+4)
LastIndexAnyUTF8/16:2-48      406ns ± 0%    216ns ± 0%  -46.70%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:4-48      458ns ± 0%    216ns ± 0%  -52.75%  (p=0.000 n=4+5)
LastIndexAnyUTF8/16:8-48      753ns ± 0%    216ns ± 0%  -71.31%  (p=0.029 n=4+4)
LastIndexAnyUTF8/16:16-48     902ns ± 0%    221ns ± 0%  -75.50%  (p=0.016 n=5+4)
LastIndexAnyUTF8/16:32-48    1.57µs ± 0%   0.24µs ± 0%  -84.46%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:64-48    2.77µs ± 0%   0.24µs ± 0%  -91.22%  (p=0.000 n=5+4)
LastIndexAnyUTF8/256:1-48    4.06µs ± 0%   1.53µs ± 0%  -62.26%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:2-48    5.92µs ± 0%   3.04µs ± 0%  -48.55%  (p=0.016 n=4+5)
LastIndexAnyUTF8/256:4-48    6.82µs ± 0%   3.04µs ± 0%  -55.34%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:8-48    11.5µs ± 0%    3.0µs ± 0%  -73.48%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:16-48   14.1µs ± 0%    3.1µs ± 0%  -77.85%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:32-48   24.5µs ± 0%    3.5µs ± 0%  -85.85%  (p=0.016 n=5+4)
LastIndexAnyUTF8/256:64-48   44.0µs ± 0%    3.5µs ± 0%  -92.12%  (p=0.008 n=5+5)
pkg:bytes goos:linux goarch:amd64
IndexAnyASCII/1:1-48         9.56ns ± 0%  11.00ns ± 0%  +15.06%  (p=0.016 n=5+4)
IndexAnyASCII/1:2-48         11.0ns ± 0%   10.8ns ± 2%   -1.64%  (p=0.048 n=5+5)
IndexAnyASCII/1:4-48         13.9ns ± 0%   11.0ns ± 1%  -21.15%  (p=0.008 n=5+5)
IndexAnyASCII/1:8-48         19.6ns ± 0%   10.8ns ± 3%  -44.90%  (p=0.008 n=5+5)
IndexAnyASCII/1:16-48        31.1ns ± 0%   11.5ns ± 0%  -63.02%  (p=0.008 n=5+5)
IndexAnyASCII/1:32-48        54.0ns ± 0%   11.8ns ± 0%  -78.15%  (p=0.000 n=5+4)
IndexAnyASCII/1:64-48         100ns ± 0%     13ns ± 0%  -86.89%  (p=0.008 n=5+5)
IndexAnyASCII/16:1-48        35.5ns ± 0%   14.8ns ± 0%  -58.26%  (p=0.008 n=5+5)
IndexAnyASCII/16:2-48        36.2ns ± 1%   36.0ns ± 1%     ~     (p=0.087 n=5+5)
IndexAnyASCII/16:4-48        40.3ns ± 1%   39.7ns ± 4%     ~     (p=0.175 n=4+5)
IndexAnyASCII/16:8-48        48.7ns ± 5%   45.8ns ± 0%   -6.02%  (p=0.016 n=5+4)
IndexAnyASCII/16:16-48       64.1ns ±11%   62.1ns ± 1%     ~     (p=0.143 n=5+5)
IndexAnyASCII/16:32-48       97.9ns ± 1%   98.3ns ± 1%     ~     (p=0.294 n=5+5)
IndexAnyASCII/16:64-48        163ns ± 0%    157ns ± 0%   -3.68%  (p=0.008 n=5+5)
IndexAnyASCII/256:1-48        389ns ± 0%     25ns ± 0%  -93.65%  (p=0.000 n=5+4)
IndexAnyASCII/256:2-48        391ns ± 0%    307ns ± 0%  -21.48%  (p=0.000 n=5+4)
IndexAnyASCII/256:4-48        394ns ± 0%    323ns ± 0%  -17.92%  (p=0.008 n=5+5)
IndexAnyASCII/256:8-48        402ns ± 0%    323ns ± 0%  -19.51%  (p=0.008 n=5+5)
IndexAnyASCII/256:16-48       414ns ± 0%    334ns ± 0%  -19.32%  (p=0.016 n=4+5)
IndexAnyASCII/256:32-48       446ns ± 0%    367ns ± 0%  -17.75%  (p=0.016 n=5+4)
IndexAnyASCII/256:64-48       511ns ± 0%    424ns ± 0%  -17.02%  (p=0.008 n=5+5)
IndexAnyUTF8/1:1-48          17.4ns ± 0%   11.0ns ± 0%  -36.64%  (p=0.008 n=5+5)
IndexAnyUTF8/1:2-48          27.3ns ± 1%   11.0ns ± 0%  -59.74%  (p=0.008 n=5+5)
IndexAnyUTF8/1:4-48          28.7ns ± 0%   11.0ns ± 0%  -61.73%  (p=0.008 n=5+5)
IndexAnyUTF8/1:8-48          49.2ns ± 0%   11.0ns ± 0%  -77.66%  (p=0.008 n=5+5)
IndexAnyUTF8/1:16-48         56.0ns ± 0%   11.5ns ± 0%  -79.46%  (p=0.000 n=5+4)
IndexAnyUTF8/1:32-48          102ns ± 0%     12ns ± 0%  -88.24%  (p=0.008 n=5+5)
IndexAnyUTF8/1:64-48          177ns ± 0%     13ns ± 0%  -92.51%  (p=0.008 n=5+5)
IndexAnyUTF8/16:1-48          212ns ± 0%    112ns ± 0%  -47.17%  (p=0.008 n=5+5)
IndexAnyUTF8/16:2-48          356ns ± 0%    159ns ± 1%  -55.28%  (p=0.000 n=4+5)
IndexAnyUTF8/16:4-48          372ns ± 0%    158ns ± 0%  -57.47%  (p=0.008 n=5+5)
IndexAnyUTF8/16:8-48          712ns ± 0%    159ns ± 1%  -77.70%  (p=0.008 n=5+5)
IndexAnyUTF8/16:16-48         829ns ± 0%    129ns ± 0%  -84.44%  (p=0.008 n=5+5)
IndexAnyUTF8/16:32-48        1.55µs ± 0%   0.16µs ± 0%  -89.87%  (p=0.008 n=5+5)
IndexAnyUTF8/16:64-48        2.77µs ± 0%   0.14µs ± 0%  -94.94%  (p=0.008 n=5+5)
IndexAnyUTF8/256:1-48        2.85µs ± 0%   1.63µs ± 1%  -42.74%  (p=0.008 n=5+5)
IndexAnyUTF8/256:2-48        5.14µs ± 1%   2.03µs ± 0%  -60.51%  (p=0.008 n=5+5)
IndexAnyUTF8/256:4-48        5.56µs ± 0%   2.03µs ± 0%  -63.52%  (p=0.008 n=5+5)
IndexAnyUTF8/256:8-48        10.8µs ± 0%    2.0µs ± 0%  -81.22%  (p=0.008 n=5+5)
IndexAnyUTF8/256:16-48       12.9µs ± 0%    1.9µs ± 0%  -85.55%  (p=0.008 n=5+5)
IndexAnyUTF8/256:32-48       24.2µs ± 0%    2.1µs ± 0%  -91.29%  (p=0.016 n=5+4)
IndexAnyUTF8/256:64-48       43.7µs ± 0%    2.0µs ± 0%  -95.32%  (p=0.016 n=5+4)
LastIndexAnyASCII/1:1-48     13.7ns ± 1%   12.8ns ± 0%   -6.57%  (p=0.016 n=5+4)
LastIndexAnyASCII/1:2-48     14.7ns ± 0%   12.7ns ± 1%  -13.33%  (p=0.000 n=4+5)
LastIndexAnyASCII/1:4-48     16.9ns ± 0%   12.7ns ± 1%  -24.73%  (p=0.000 n=4+5)
LastIndexAnyASCII/1:8-48     20.5ns ± 0%   12.7ns ± 0%  -37.85%  (p=0.000 n=4+5)
LastIndexAnyASCII/1:16-48    28.0ns ± 0%   11.7ns ± 0%     ~     (p=0.079 n=4+5)
LastIndexAnyASCII/1:32-48    69.8ns ± 0%   12.4ns ± 0%  -82.19%  (p=0.008 n=5+5)
LastIndexAnyASCII/1:64-48    73.8ns ± 0%   13.3ns ± 0%  -82.03%  (p=0.000 n=4+5)
LastIndexAnyASCII/16:1-48    35.5ns ± 0%   35.5ns ± 0%     ~     (all equal)
LastIndexAnyASCII/16:2-48    36.0ns ± 0%   36.1ns ± 0%   +0.28%  (p=0.016 n=4+5)
LastIndexAnyASCII/16:4-48    40.3ns ± 2%   40.0ns ± 6%     ~     (p=0.651 n=5+5)
LastIndexAnyASCII/16:8-48    50.3ns ± 0%   50.2ns ± 9%     ~     (p=0.175 n=4+5)
LastIndexAnyASCII/16:16-48   62.4ns ± 4%   64.4ns ± 0%   +3.28%  (p=0.016 n=5+4)
LastIndexAnyASCII/16:32-48   98.9ns ± 0%   98.4ns ± 0%   -0.53%  (p=0.016 n=5+4)
LastIndexAnyASCII/16:64-48    160ns ± 1%    161ns ± 1%     ~     (p=0.325 n=5+5)
LastIndexAnyASCII/256:1-48    300ns ± 0%    301ns ± 0%   +0.33%  (p=0.008 n=5+5)
LastIndexAnyASCII/256:2-48    304ns ± 0%    304ns ± 0%     ~     (p=1.000 n=5+5)
LastIndexAnyASCII/256:4-48    311ns ± 0%    311ns ± 0%     ~     (p=0.556 n=4+5)
LastIndexAnyASCII/256:8-48    320ns ± 0%    321ns ± 0%     ~     (p=0.143 n=5+5)
LastIndexAnyASCII/256:16-48   333ns ± 0%    335ns ± 0%   +0.60%  (p=0.029 n=4+4)
LastIndexAnyASCII/256:32-48   367ns ± 0%    366ns ± 0%     ~     (p=0.095 n=4+5)
LastIndexAnyASCII/256:64-48   431ns ± 0%    424ns ± 0%   -1.62%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:1-48      19.7ns ± 1%   11.9ns ± 0%  -39.47%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:2-48      27.6ns ± 1%   11.9ns ± 0%  -56.82%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:4-48      29.9ns ± 0%   11.9ns ± 0%     ~     (p=0.079 n=4+5)
LastIndexAnyUTF8/1:8-48      48.7ns ± 0%   11.9ns ± 0%  -75.54%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:16-48     57.8ns ± 0%   11.4ns ± 0%  -80.26%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:32-48     94.7ns ± 0%   12.2ns ± 0%  -87.07%  (p=0.008 n=5+5)
LastIndexAnyUTF8/1:64-48      163ns ± 0%     13ns ± 1%  -91.93%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:1-48      258ns ± 0%     88ns ± 0%  -65.76%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:2-48      400ns ± 0%    162ns ± 0%  -59.38%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:4-48      415ns ± 0%    162ns ± 0%  -60.87%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:8-48      737ns ± 0%    162ns ± 0%  -78.02%  (p=0.000 n=5+4)
LastIndexAnyUTF8/16:16-48     882ns ± 0%    128ns ± 0%  -85.49%  (p=0.008 n=5+5)
LastIndexAnyUTF8/16:32-48    1.47µs ± 0%   0.16µs ± 0%  -89.29%  (p=0.000 n=4+5)
LastIndexAnyUTF8/16:64-48    2.56µs ± 0%   0.14µs ± 0%  -94.41%  (p=0.016 n=5+4)
LastIndexAnyUTF8/256:1-48    3.60µs ± 0%   1.23µs ± 0%  -65.67%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:2-48    5.78µs ± 0%   2.18µs ± 0%  -62.32%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:4-48    6.26µs ± 0%   2.18µs ± 0%  -65.15%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:8-48    11.2µs ± 0%    2.2µs ± 0%  -80.53%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:16-48   13.5µs ± 0%    1.9µs ± 0%  -86.02%  (p=0.016 n=4+5)
LastIndexAnyUTF8/256:32-48   23.0µs ± 0%    2.1µs ± 0%  -90.72%  (p=0.008 n=5+5)
LastIndexAnyUTF8/256:64-48   40.5µs ± 0%    2.1µs ± 0%  -94.73%  (p=0.008 n=5+5)

Change-Id: Ie05e306f8b184b989701868cb161ce8b3f18203b
Reviewed-on: https://go-review.googlesource.com/c/go/+/156998
Run-TryBot: eric fang <eric.fang@arm.com>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2020-03-11 05:13:13 +00:00