mirror of https://github.com/golang/go.git
600 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
ff14e08cd3 |
cmd/compile, math: improve implementation of math.{Max,Min} on loong64
Make math.{Min,Max} intrinsics and implement math.{archMax,archMin}
in hardware.
goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A6000 @ 2500.00MHz
│ old.bench │ new.bench │
│ sec/op │ sec/op vs base │
Max 7.606n ± 0% 3.087n ± 0% -59.41% (p=0.000 n=20)
Min 7.205n ± 0% 2.904n ± 0% -59.69% (p=0.000 n=20)
MinFloat 37.220n ± 0% 4.802n ± 0% -87.10% (p=0.000 n=20)
MaxFloat 33.620n ± 0% 4.802n ± 0% -85.72% (p=0.000 n=20)
geomean 16.18n 3.792n -76.57%
goos: linux
goarch: loong64
pkg: runtime
cpu: Loongson-3A5000 @ 2500.00MHz
│ old.bench │ new.bench │
│ sec/op │ sec/op vs base │
Max 10.010n ± 0% 7.196n ± 0% -28.11% (p=0.000 n=20)
Min 8.806n ± 0% 7.155n ± 0% -18.75% (p=0.000 n=20)
MinFloat 60.010n ± 0% 7.976n ± 0% -86.71% (p=0.000 n=20)
MaxFloat 56.410n ± 0% 7.980n ± 0% -85.85% (p=0.000 n=20)
geomean 23.37n 7.566n -67.63%
Updates #59120.
Change-Id: I6815d20bc304af3cbf5d6ca8fe0ca1c2ddebea2d
Reviewed-on: https://go-review.googlesource.com/c/go/+/580283
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
|
|
|
|
c18ff29295 |
cmd/compile: make sync/atomic AND/OR operations intrinsic on amd64
Update #61395 Change-Id: I59a950f48efc587dfdffce00e2f4f3ab99d8df00 Reviewed-on: https://go-review.googlesource.com/c/go/+/594738 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Nicolas Hillegeer <aktau@google.com> |
|
|
|
dbfa3cacc7 |
cmd/compile: fix typing of atomic logical operations
For atomic AND and OR operations on memory, we currently have two views of the op. One just does the operation on the memory and returns just a memory. The other does the operation on the memory and returns the old value (before having the logical operation done to it) and memory. Update #61395 These two type differently, and there's currently some confusion in our rules about which is which. Use different names for the two different flavors so we don't get them confused. Change-Id: I07b4542db672b2cee98169ac42b67db73c482093 Reviewed-on: https://go-review.googlesource.com/c/go/+/594976 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Nicolas Hillegeer <aktau@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Keith Randall <khr@google.com> |
|
|
|
63c1e141bc |
cmd/compile: intrinsify atomic And/Or on arm64
The atomic And/Or operators were added by the CL 528797,
the compiler does not intrinsify them, this CL does it for
arm64.
Also, for the existing atomicAnd/Or operations, the updated
value are not used, but at that time we need a register to
temporarily hold it. Now that we have v.RegTmp, the new value
is not needed anymore. This CL changes it.
The other change is that the existing operations don't use their
result, but now we need the old value and not the new value for
the result.
And this CL alias all of the And/Or operations into sync/atomic
package.
Peformance on an ARMv8.1 machine:
old.txt new.txt
sec/op sec/op vs base
And32-160 8.716n ± 0% 4.771n ± 1% -45.26% (p=0.000 n=10)
And32Parallel-160 30.58n ± 2% 26.45n ± 4% -13.49% (p=0.000 n=10)
And64-160 8.750n ± 1% 4.754n ± 0% -45.67% (p=0.000 n=10)
And64Parallel-160 29.40n ± 3% 25.55n ± 5% -13.11% (p=0.000 n=10)
Or32-160 8.847n ± 1% 4.754±1% -46.26% (p=0.000 n=10)
Or32Parallel-160 30.75n ± 3% 26.10n ± 4% -15.14% (p=0.000 n=10)
Or64-160 8.825n ± 1% 4.766n ± 0% -46.00% (p=0.000 n=10)
Or64Parallel-160 30.52n ± 5% 25.89n ± 6% -15.17% (p=0.000 n=10)
For #61395
Change-Id: Ib1d1ac83f7f67dcf67f74d003fadb0f80932b826
Reviewed-on: https://go-review.googlesource.com/c/go/+/584715
Auto-Submit: Austin Clements <austin@google.com>
TryBot-Bypass: Austin Clements <austin@google.com>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Fannie Zhang <Fannie.Zhang@arm.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
|
|
dca577d882 |
cmd/compile/internal/ssa: reintroduce ANDconst opcode on PPC64
This allows more effective conversion of rotate and mask opcodes into their CC equivalents, while simplifying the first lowering pass. This was removed before the latelower pass was introduced to fold more cases of compare against zero. Add ANDconst to push the conversion of ANDconst to ANDCCconst into latelower with the other CC opcodes. This also requires introducing RLDICLCC to prevent regressions when ANDconst is converted to RLDICL then to RLDICLCC and back to ANDCCconst when possible. Change-Id: I9e5f9c99fbefa334db18c6c152c5f967f3ff2590 Reviewed-on: https://go-review.googlesource.com/c/go/+/586160 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> |
|
|
|
dfb17c126c |
cmd/compile: support float min/max instructions on PPC64
This enables efficient use of the builtin min/max function for float64 and float32 types on GOPPC64 >= power9. Extend the assembler to support xsminjdp/xsmaxjdp and use them to implement float min/max. Simplify the VSX xx3 opcode rules to allow FPR arguments, if all arguments are an FPR. Change-Id: I15882a4ce5dc46eba71d683cf1d184dc4236a328 Reviewed-on: https://go-review.googlesource.com/c/go/+/574535 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Paul Murphy <murp@ibm.com> Reviewed-by: Than McIntosh <thanm@google.com> |
|
|
|
c7065bb9db |
cmd/compile/internal: generate ADDZE on PPC64
This usage shows up in quite a few places, and helps reduce register pressure in several complex cryto functions by removing a MOVD $0,... instruction. Change-Id: I9444ea8f9d19bfd68fb71ea8dc34e109681b3802 Reviewed-on: https://go-review.googlesource.com/c/go/+/571055 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Paul Murphy <murp@ibm.com> |
|
|
|
997636760e |
cmd/compile,cmd/internal/obj: provide rotation pseudo-instructions for riscv64
Provide and use rotation pseudo-instructions for riscv64. The RISC-V bitmanip extension adds support for hardware rotation instructions in the form of ROL, ROLW, ROR, RORI, RORIW and RORW. These are easily implemented in the assembler as pseudo-instructions for CPUs that do not support the bitmanip extension. This approach provides a number of advantages, including reducing the rewrite rules needed in the compiler, simplifying codegen tests and most importantly, allowing these instructions to be used in assembly (for example, riscv64 optimised versions of SHA-256 and SHA-512). When bitmanip support is added, these instruction sequences can simply be replaced with a single instruction if permitted by the GORISCV64 profile. Change-Id: Ia23402e1a82f211ac760690deb063386056ae1fa Reviewed-on: https://go-review.googlesource.com/c/go/+/565015 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: M Zhuo <mengzhuo1203@gmail.com> Reviewed-by: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Run-TryBot: Joel Sing <joel@sing.id.au> |
|
|
|
daa58db486 |
cmd/compile: improve rotations for riscv64
Enable canRotate for riscv64, enable rotation intrinsics and provide
better rewrite implementations for rotations. By avoiding Lsh*x64
and Rsh*Ux64 we can produce better code, especially for 32 and 64
bit rotations. By enabling canRotate we also benefit from the generic
rotation rewrite rules.
Benchmark on a StarFive VisionFive 2:
│ rotate.1 │ rotate.2 │
│ sec/op │ sec/op vs base │
RotateLeft-4 14.700n ± 0% 8.016n ± 0% -45.47% (p=0.000 n=10)
RotateLeft8-4 14.70n ± 0% 10.69n ± 0% -27.28% (p=0.000 n=10)
RotateLeft16-4 14.70n ± 0% 12.02n ± 0% -18.23% (p=0.000 n=10)
RotateLeft32-4 13.360n ± 0% 8.016n ± 0% -40.00% (p=0.000 n=10)
RotateLeft64-4 13.360n ± 0% 8.016n ± 0% -40.00% (p=0.000 n=10)
geomean 14.15n 9.208n -34.92%
Change-Id: I1a2036fdc57cf88ebb6617eb8d92e1d187e183b2
Reviewed-on: https://go-review.googlesource.com/c/go/+/560315
Reviewed-by: M Zhuo <mengzhuo1203@gmail.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
|
|
|
|
09ed9a6585 |
cmd/compile: implement float min/max in hardware for riscv64
CL 514596 adds float min/max for amd64, this CL adds it for riscv64.
The behavior of the RISC-V FMIN/FMAX instructions almost match Go's
requirements.
However according to RISCV spec 8.3 "NaN Generation and Propagation"
>> if at least one input is a signaling NaN, or if both inputs are quiet
>> NaNs, the result is the canonical NaN. If one operand is a quiet NaN
>> and the other is not a NaN, the result is the non-NaN operand.
Go using quiet NaN as NaN and according to Go spec
>> if any argument is a NaN, the result is a NaN
This requires the float min/max implementation to check whether one
of operand is qNaN before float mix/max actually execute.
This CL also fix a typo in minmax test.
Benchmark on Visionfive2
goos: linux
goarch: riscv64
pkg: runtime
│ float_minmax.old.bench │ float_minmax.new.bench │
│ sec/op │ sec/op vs base │
MinFloat 158.20n ± 0% 28.13n ± 0% -82.22% (p=0.000 n=10)
MaxFloat 158.10n ± 0% 28.12n ± 0% -82.21% (p=0.000 n=10)
geomean 158.1n 28.12n -82.22%
Update #59488
Change-Id: Iab48be6d32b8882044fb8c821438ca8840e5493d
Reviewed-on: https://go-review.googlesource.com/c/go/+/514775
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Run-TryBot: M Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
|
|
|
|
6b77d1b736 |
cmd/compile: update loong64 CALL* ops
allow the loong64 CALL* ops to take variable number of args Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: I4706d9651fcbf9a0f201af6820c97b1a924f14e3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521781 Auto-Submit: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: David Chase <drchase@google.com> |
|
|
|
ebca52eeb7 |
cmd/compile/internal: add register info for loong64 regABI
Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: Ifd7d94147b01e4fc83978b53dca2bcc0ad1ac4e3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521779 Reviewed-by: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> |
|
|
|
070139a130 |
cmd/compile,cmd/internal,runtime: change registers on loong64 to avoid regABI arguments
Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: Ic7e2e7fb4c1d3670e6abbfb817aa6e4e654e08d3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521777 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Auto-Submit: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: David Chase <drchase@google.com> |
|
|
|
f43581131e |
cmd/compile, cmd/internal, runtime: change the registers used by the duff device for loong64
Add R21 to the allocatable registers, use R20 and R21 in duff device. This CL is in preparation for subsequent regABI support. Updates #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: If1661adc0f766925fbe74827a369797f95fa28a9 Reviewed-on: https://go-review.googlesource.com/c/go/+/521775 Reviewed-by: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> |
|
|
|
773039ed5c |
cmd/compile/internal/ssa: on PPC64, merge (CMPconst [0] (op ...)) more aggressively
Generate the CC version of many opcodes whose result is compared against signed 0. The approach taken here works even if the opcode result is used in multiple places too. Add support for ADD, ADDconst, ANDN, SUB, NEG, CNTLZD, NOR conversions to their CC opcode variant. These are the most commonly used variants. Also, do not set clobberFlags of CNTLZD and CNTLZW, they do not clobber flags. This results in about 1% smaller text sections in kubernetes binaries, and no regressions in the crypto benchmarks. Change-Id: I9e0381944869c3774106bf348dead5ecb96dffda Reviewed-on: https://go-review.googlesource.com/c/go/+/538636 Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Jayanth Krishnamurthy <jayanth.krishnamurthy@ibm.com> Reviewed-by: Heschi Kreinick <heschi@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> |
|
|
|
962ccbef91 |
cmd/compile: ensure pointer arithmetic happens after the nil check
Have nil checks return a pointer that is known non-nil. Users of that pointer can use the result, ensuring that they are ordered after the nil check itself. The order dependence goes away after scheduling, when we've fixed an order. At that point we move uses back to the original pointer so it doesn't change regalloc any. This prevents pointer arithmetic on nil from being spilled to the stack and then observed by a stack scan. Fixes #63657 Change-Id: I1a5fa4f2e6d9000d672792b4f90dfc1b7b67f6ea Reviewed-on: https://go-review.googlesource.com/c/go/+/537775 Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> |
|
|
|
8fc043ccfa |
cmd/compile: optimize right shifts of int32 on riscv64
The compiler is currently sign extending 32 bit signed integers to
64 bits before right shifting them using a 64 bit shift instruction.
There's no need to do this as RISC-V has instructions for right
shifting 32 bit signed values (sraw and sraiw) which sign extend
the result of the shift to 64 bits. Change the compiler so that
it uses sraw and sraiw for shifts of signed 32 bit integers reducing
in most cases the number of instructions needed to perform the shift.
Here are some examples of code sequences that are changed by this
patch:
int32(a) >> 2
before:
sll x5,x10,0x20
sra x10,x5,0x22
after:
sraw x10,x10,0x2
int32(v) >> int(s)
before:
sext.w x5,x10
sltiu x6,x11,64
add x6,x6,-1
or x6,x11,x6
sra x10,x5,x6
after:
sltiu x5,x11,32
add x5,x5,-1
or x5,x11,x5
sraw x10,x10,x5
int32(v) >> (int(s) & 31)
before:
sext.w x5,x10
and x6,x11,63
sra x10,x5,x6
after:
and x5,x11,31
sraw x10,x10,x5
int32(100) >> int(a)
before:
bltz x10,<target address calls runtime.panicshift>
sltiu x5,x10,64
add x5,x5,-1
or x5,x10,x5
li x6,100
sra x10,x6,x5
after:
bltz x10,<target address calls runtime.panicshift>
sltiu x5,x10,32
add x5,x5,-1
or x5,x10,x5
li x6,100
sraw x10,x6,x5
int32(v) >> (int(s) & 63)
before:
sext.w x5,x10
and x6,x11,63
sra x10,x5,x6
after:
and x5,x11,63
sltiu x6,x5,32
add x6,x6,-1
or x5,x5,x6
sraw x10,x10,x5
In most cases we eliminate one instruction. In the case where
we shift a int32 constant by a variable the number of instructions
generated is identical. A sra is simply replaced by a sraw. In the
unusual case where we shift right by a variable anded with a constant
> 31 but < 64, we generate two additional instructions. As this is
an unusual case we do not try to optimize for it.
Some improvements can be seen in some of the existing benchmarks,
notably in the utf8 package which performs right shifts of runes
which are signed 32 bit integers.
| utf8-old | utf8-new |
| sec/op | sec/op vs base |
EncodeASCIIRune-4 17.68n ± 0% 17.67n ± 0% ~ (p=0.312 n=10)
EncodeJapaneseRune-4 35.34n ± 0% 34.53n ± 1% -2.31% (p=0.000 n=10)
AppendASCIIRune-4 3.213n ± 0% 3.213n ± 0% ~ (p=0.318 n=10)
AppendJapaneseRune-4 36.14n ± 0% 35.35n ± 0% -2.19% (p=0.000 n=10)
DecodeASCIIRune-4 28.11n ± 0% 27.36n ± 0% -2.69% (p=0.000 n=10)
DecodeJapaneseRune-4 38.55n ± 0% 38.58n ± 0% ~ (p=0.612 n=10)
Change-Id: I60a91cbede9ce65597571c7b7dd9943eeb8d3cc2
Reviewed-on: https://go-review.googlesource.com/c/go/+/535115
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: M Zhuo <mzh@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
|
|
|
|
3754ca0af2 |
cmd/compile: improve the implementation of Lowered{Move,Zero} on linux/loong64
Like the CL 487295, when implementing Lowered{Move,Zero}, 8 is first subtracted
from Rarg0 (parameter Ptr), and then the offset of 8 is added during subsequent
operations on Rarg0. This operation is meaningless, so delete it.
Change LoweredMove's Rarg0 register to R20, consistent with duffcopy.
goos: linux
goarch: loong64
pkg: runtime
cpu: Loongson-3C5000 @ 2200.00MHz
│ old.bench │ new.bench │
│ sec/op │ sec/op vs base │
Memmove/15 19.10n ± 0% 19.10n ± 0% ~ (p=0.483 n=15)
MemmoveUnalignedDst/15 25.02n ± 0% 25.02n ± 0% ~ (p=0.741 n=15)
MemmoveUnalignedDst/32 48.22n ± 0% 48.22n ± 0% ~ (p=1.000 n=15) ¹
MemmoveUnalignedDst/64 90.57n ± 0% 90.52n ± 0% ~ (p=0.212 n=15)
MemmoveUnalignedDstOverlap/32 44.12n ± 0% 44.13n ± 0% +0.02% (p=0.000 n=15)
MemmoveUnalignedDstOverlap/64 87.79n ± 0% 87.80n ± 0% +0.01% (p=0.002 n=15)
MemmoveUnalignedSrc/0 3.639n ± 0% 3.639n ± 0% ~ (p=1.000 n=15) ¹
MemmoveUnalignedSrc/1 7.733n ± 0% 7.733n ± 0% ~ (p=1.000 n=15)
MemmoveUnalignedSrc/2 9.097n ± 0% 9.097n ± 0% ~ (p=1.000 n=15)
MemmoveUnalignedSrc/3 10.46n ± 0% 10.46n ± 0% ~ (p=1.000 n=15) ¹
MemmoveUnalignedSrc/4 11.83n ± 0% 11.83n ± 0% ~ (p=1.000 n=15) ¹
MemmoveUnalignedSrc/64 93.71n ± 0% 93.70n ± 0% ~ (p=0.128 n=15)
Memclr/4096 699.1n ± 0% 699.1n ± 0% ~ (p=0.682 n=15)
Memclr/65536 11.18µ ± 0% 11.18µ ± 0% -0.01% (p=0.000 n=15)
Memclr/1M 175.2µ ± 0% 175.2µ ± 0% ~ (p=0.191 n=15)
Memclr/4M 661.8µ ± 0% 662.0µ ± 0% ~ (p=0.486 n=15)
MemclrUnaligned/4_5 19.39n ± 0% 20.47n ± 0% +5.57% (p=0.000 n=15)
MemclrUnaligned/4_16 22.29n ± 0% 21.38n ± 0% -4.08% (p=0.000 n=15)
MemclrUnaligned/4_64 30.58n ± 0% 29.81n ± 0% -2.52% (p=0.000 n=15)
MemclrUnaligned/4_65536 11.19µ ± 0% 11.20µ ± 0% +0.02% (p=0.000 n=15)
GoMemclr/5 12.73n ± 0% 12.73n ± 0% ~ (p=0.261 n=15)
GoMemclr/16 10.01n ± 0% 10.00n ± 0% ~ (p=0.264 n=15)
GoMemclr/256 50.94n ± 0% 50.94n ± 0% ~ (p=0.372 n=15)
ClearFat15 14.95n ± 0% 15.01n ± 4% ~ (p=0.925 n=15)
ClearFat1032 125.5n ± 0% 125.6n ± 0% +0.08% (p=0.000 n=15)
CopyFat64 10.58n ± 0% 10.01n ± 0% -5.39% (p=0.000 n=15)
CopyFat1040 244.3n ± 0% 155.6n ± 0% -36.31% (p=0.000 n=15)
Issue18740/2byte 29.82µ ± 0% 29.82µ ± 0% ~ (p=0.648 n=30)
Issue18740/4byte 18.18µ ± 0% 18.18µ ± 0% -0.02% (p=0.001 n=30)
Issue18740/8byte 8.395µ ± 0% 8.395µ ± 0% ~ (p=0.401 n=30)
geomean 154.5n 151.8n -1.70%
¹ all samples are equal
Change-Id: Ia3f3c8b25e1e93c97ab72328651de78ca9dec016
Reviewed-on: https://go-review.googlesource.com/c/go/+/488515
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Auto-Submit: Ian Lance Taylor <iant@golang.org>
Reviewed-by: WANG Xuerui <git@xen0n.name>
Reviewed-by: xiaodong liu <teaofmoli@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
|
|
f711892a8a |
cmd/compile/internal: stop lowering OpConvert on riscv64
Lowering for OpConvert was removed for all architectures in CL#108496, prior to the riscv64 port being upstreamed. Remove lowering of OpConvert on riscv64, which brings it inline with all other architectures. This results in 1,600+ instructions being removed from the riscv64 go binary. Change-Id: Iaaf1f8b397875926604048b66ad8ac91a98c871e Reviewed-on: https://go-review.googlesource.com/c/go/+/533335 Run-TryBot: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> |
|
|
|
561bf0457f |
cmd/compile: optimize right shifts of uint32 on riscv
The compiler is currently zero extending 32 bit unsigned integers to
64 bits before right shifting them using a 64 bit shift instruction.
There's no need to do this as RISC-V has instructions for right
shifting 32 bit unsigned values (srlw and srliw) which zero extend
the result of the shift to 64 bits. Change the compiler so that
it uses srlw and srliw for 32 bit unsigned shifts reducing in most
cases the number of instructions needed to perform the shift.
Here are some examples of code sequences that are changed by this
patch:
uint32(a) >> 2
before:
sll x5,x10,0x20
srl x10,x5,0x22
after:
srlw x10,x10,0x2
uint32(a) >> int(b)
before:
sll x5,x10,0x20
srl x5,x5,0x20
srl x5,x5,x11
sltiu x6,x11,64
neg x6,x6
and x10,x5,x6
after:
srlw x5,x10,x11
sltiu x6,x11,32
neg x6,x6
and x10,x5,x6
bits.RotateLeft32(uint32(a), 1)
before:
sll x5,x10,0x1
sll x6,x10,0x20
srl x7,x6,0x3f
or x5,x5,x7
after:
sll x5,x10,0x1
srlw x6,x10,0x1f
or x10,x5,x6
bits.RotateLeft32(uint32(a), int(b))
before:
and x6,x11,31
sll x7,x10,x6
sll x8,x10,0x20
srl x8,x8,0x20
add x6,x6,-32
neg x6,x6
srl x9,x8,x6
sltiu x6,x6,64
neg x6,x6
and x6,x9,x6
or x6,x6,x7
after:
and x5,x11,31
sll x6,x10,x5
add x5,x5,-32
neg x5,x5
srlw x7,x10,x5
sltiu x5,x5,32
neg x5,x5
and x5,x7,x5
or x10,x6,x5
The one regression observed is the following case, an unbounded right
shift of a uint32 where the value we're shifting by is known to be
< 64 but > 31. As this is an unusual case this commit does not
optimize for it, although the existing code does.
uint32(a) >> (b & 63)
before:
sll x5,x10,0x20
srl x5,x5,0x20
and x6,x11,63
srl x10,x5,x6
after
and x5,x11,63
srlw x6,x10,x5
sltiu x5,x5,32
neg x5,x5
and x10,x6,x5
Here we have one extra instruction.
Some benchmark highlights, generated on a VisionFive2 8GB running
Ubuntu 23.04.
pkg: math/bits
LeadingZeros32-4 18.64n ± 0% 17.32n ± 0% -7.11% (p=0.000 n=10)
LeadingZeros64-4 15.47n ± 0% 15.51n ± 0% +0.26% (p=0.027 n=10)
TrailingZeros16-4 18.48n ± 0% 17.68n ± 0% -4.33% (p=0.000 n=10)
TrailingZeros32-4 16.87n ± 0% 16.07n ± 0% -4.74% (p=0.000 n=10)
TrailingZeros64-4 15.26n ± 0% 15.27n ± 0% +0.07% (p=0.043 n=10)
OnesCount32-4 20.08n ± 0% 19.29n ± 0% -3.96% (p=0.000 n=10)
RotateLeft-4 8.864n ± 0% 8.838n ± 0% -0.30% (p=0.006 n=10)
RotateLeft32-4 8.837n ± 0% 8.032n ± 0% -9.11% (p=0.000 n=10)
Reverse32-4 29.77n ± 0% 26.52n ± 0% -10.93% (p=0.000 n=10)
ReverseBytes32-4 9.640n ± 0% 8.838n ± 0% -8.32% (p=0.000 n=10)
Sub32-4 8.835n ± 0% 8.035n ± 0% -9.06% (p=0.000 n=10)
geomean 11.50n 11.33n -1.45%
pkg: crypto/md5
Hash8Bytes-4 1.486µ ± 0% 1.426µ ± 0% -4.04% (p=0.000 n=10)
Hash64-4 2.079µ ± 0% 1.968µ ± 0% -5.36% (p=0.000 n=10)
Hash128-4 2.720µ ± 0% 2.557µ ± 0% -5.99% (p=0.000 n=10)
Hash256-4 3.996µ ± 0% 3.733µ ± 0% -6.58% (p=0.000 n=10)
Hash512-4 6.541µ ± 0% 6.072µ ± 0% -7.18% (p=0.000 n=10)
Hash1K-4 11.64µ ± 0% 10.75µ ± 0% -7.58% (p=0.000 n=10)
Hash8K-4 82.95µ ± 0% 76.32µ ± 0% -7.99% (p=0.000 n=10)
Hash1M-4 10.436m ± 0% 9.591m ± 0% -8.10% (p=0.000 n=10)
Hash8M-4 83.50m ± 0% 76.73m ± 0% -8.10% (p=0.000 n=10)
Hash8BytesUnaligned-4 1.494µ ± 0% 1.434µ ± 0% -4.02% (p=0.000 n=10)
Hash1KUnaligned-4 11.64µ ± 0% 10.76µ ± 0% -7.52% (p=0.000 n=10)
Hash8KUnaligned-4 83.01µ ± 0% 76.32µ ± 0% -8.07% (p=0.000 n=10)
geomean 28.32µ 26.42µ -6.72%
Change-Id: I20483a6668cca1b53fe83944bee3706aadcf8693
Reviewed-on: https://go-review.googlesource.com/c/go/+/528975
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
|
|
d98f74b31e |
cmd/compile/internal: intrinsify publicationBarrier on riscv64
This enables publicationBarrier to be used as an intrinsic
on riscv64, optimizing the required function call and return
instructions for invoking the "runtime.publicationBarrier"
function.
This function is called by mallocgc. The benchmark results for malloc tested on Lichee-Pi-4A(TH1520, RISC-V 2.0G C910 x4) are as follows.
goos: linux
goarch: riscv64
pkg: runtime
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Malloc8-4 92.78n ± 1% 90.77n ± 1% -2.17% (p=0.001 n=10)
Malloc16-4 156.5n ± 1% 151.7n ± 2% -3.10% (p=0.000 n=10)
MallocTypeInfo8-4 131.7n ± 1% 130.6n ± 2% ~ (p=0.165 n=10)
MallocTypeInfo16-4 186.5n ± 2% 186.2n ± 1% ~ (p=0.956 n=10)
MallocLargeStruct-4 1.345µ ± 1% 1.355µ ± 1% ~ (p=0.093 n=10)
geomean 216.9n 214.5n -1.10%
Change-Id: Ieab6c02309614bac5c1b12b5ee3311f988ff644d
Reviewed-on: https://go-review.googlesource.com/c/go/+/531719
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: M Zhuo <mzh@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
|
|
|
|
06f420fc19 |
runtime: remove the meaningless offset of 8 for duffzero on loong64
Currently we subtract 8 from offset when calling duffzero because 8 is added to offset in the duffzero implementation. This operation is meaningless, so remove it. Change-Id: I7e451d04d7e98ccafe711645d81d3aadf376766f Reviewed-on: https://go-review.googlesource.com/c/go/+/487295 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: WANG Xuerui <git@xen0n.name> Run-TryBot: WANG Xuerui <git@xen0n.name> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: xiaodong liu <teaofmoli@gmail.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Ian Lance Taylor <iant@golang.org> |
|
|
|
63ab68ddc5 |
cmd/compile: add single-precision FMA code generation for riscv64
This CL adds FMADDS,FMSUBS,FNMADDS,FNMSUBS SSA support for riscv Change-Id: I1e7dd322b46b9e0f4923dbba256303d69ed12066 Reviewed-on: https://go-review.googlesource.com/c/go/+/506616 Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: M Zhuo <mzh@golangcn.org> |
|
|
|
05f9511582 |
cmd/compile: improve FP FMA performance on riscv64
FMADD/FMSUB/FNSUB are an efficient FP FMA instructions, which can be used by the compiler to improve FP performance. Erf 188.0n ± 2% 139.5n ± 2% -25.82% (p=0.000 n=10) Erfc 193.6n ± 1% 143.2n ± 1% -26.01% (p=0.000 n=10) Erfinv 244.4n ± 2% 172.6n ± 0% -29.40% (p=0.000 n=10) Erfcinv 244.7n ± 2% 173.0n ± 1% -29.31% (p=0.000 n=10) geomean 216.0n 156.3n -27.65% Ref: The RISC-V Instruction Set Manual Volume I: Unprivileged ISA 11.6 Single-Precision Floating-Point Computational Instructions Change-Id: I89aa3a4df7576fdd47f4a6ee608ac16feafd093c Reviewed-on: https://go-review.googlesource.com/c/go/+/506036 Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: M Zhuo <mzh@golangcn.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> |
|
|
|
41c71d48a1 |
cmd/compile/internal: add RLDICR opcode for PPC64
This is encoded similarly to RLDICL, but can clear the least significant bits. Likewise, update the auxint encoding of RLDICL to match those used by the rotate and mask word ssa opcodes for easier usage within lowering rules. The RLDICL ssa opcode is not used yet. Change-Id: I42486dd95714a3e8e2f19ab237a6cf3af520c905 Reviewed-on: https://go-review.googlesource.com/c/go/+/515575 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Paul Murphy <murp@ibm.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> |
|
|
|
611706b171 |
cmd/compile: don't use BTS when OR works, add direct memory BTS operations
Stop using BTSconst and friends when ORLconst can be used instead. OR can be issued by more function units than BTS can, so it could lead to better IPC. OR might take a few more bytes to encode, but not a lot more. Still use BTSconst for cases where the constant otherwise wouldn't fit and would require a separate movabs instruction to materialize the constant. This happens when setting bits 31-63 of 64-bit targets. Add BTS-to-memory operations so we don't need to load/bts/store. Fixes #61694 Change-Id: I00379608df8fb0167cb01466e97d11dec7c1596c Reviewed-on: https://go-review.googlesource.com/c/go/+/515755 Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> |
|
|
|
319504ce43 |
cmd/compile: implement float min/max in hardware for amd64 and arm64
Update #59488 Change-Id: I89f5ea494cbcc887f6fae8560e57bcbd8749be86 Reviewed-on: https://go-review.googlesource.com/c/go/+/514596 Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> |
|
|
|
67983c0f78 |
cmd/compile: add indexed SET* opcodes for amd64
Update #61356 Change-Id: I391af98563b1c068208784c80ea736c78c29639d Reviewed-on: https://go-review.googlesource.com/c/go/+/510435 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: Martin Möhrmann <martin@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> |
|
|
|
d9fd19a7f5 |
cmd/compile: optimize math.Float32bits and math.Float32frombits on mipsx
This CL use MFC1/MTC1 instructions to move data between GPR and FPR instead of stores and loads to move float/int values.
goos: linux
goarch: mipsle
pkg: math
│ oldmathf │ newmathf │
│ sec/op │ sec/op vs base │
Acos-4 282.7n ± 0% 282.1n ± 0% -0.18% (p=0.010 n=8)
Acosh-4 450.8n ± 0% 450.9n ± 0% ~ (p=0.699 n=8)
Asin-4 272.6n ± 0% 272.1n ± 0% ~ (p=0.050 n=8)
Asinh-4 476.8n ± 0% 475.1n ± 0% -0.35% (p=0.018 n=8)
Atan-4 208.1n ± 0% 207.7n ± 0% -0.17% (p=0.009 n=8)
Atanh-4 448.8n ± 0% 448.7n ± 0% -0.03% (p=0.014 n=8)
Atan2-4 310.2n ± 0% 310.1n ± 0% ~ (p=0.133 n=8)
Cbrt-4 357.9n ± 0% 358.4n ± 0% +0.11% (p=0.014 n=8)
Ceil-4 203.8n ± 0% 204.7n ± 0% +0.42% (p=0.008 n=8)
Compare-4 21.12n ± 0% 22.09n ± 0% +4.59% (p=0.000 n=8)
Compare32-4 19.105n ± 0% 6.022n ± 0% -68.48% (p=0.000 n=8)
Copysign-4 33.17n ± 0% 33.15n ± 0% ~ (p=0.795 n=8)
Cos-4 385.2n ± 0% 384.8n ± 1% ~ (p=0.112 n=8)
Cosh-4 546.0n ± 0% 545.0n ± 0% -0.17% (p=0.012 n=8)
Erf-4 192.4n ± 0% 195.4n ± 1% +1.59% (p=0.000 n=8)
Erfc-4 187.8n ± 0% 192.7n ± 0% +2.64% (p=0.000 n=8)
Erfinv-4 221.8n ± 1% 219.8n ± 0% -0.88% (p=0.000 n=8)
Erfcinv-4 224.1n ± 1% 219.9n ± 0% -1.87% (p=0.000 n=8)
Exp-4 434.7n ± 0% 435.0n ± 0% ~ (p=0.339 n=8)
ExpGo-4 433.7n ± 0% 434.2n ± 0% +0.13% (p=0.005 n=8)
Expm1-4 243.0n ± 0% 242.9n ± 0% ~ (p=0.103 n=8)
Exp2-4 426.6n ± 0% 426.6n ± 0% ~ (p=0.822 n=8)
Exp2Go-4 425.6n ± 0% 425.5n ± 0% ~ (p=0.377 n=8)
Abs-4 8.033n ± 0% 8.029n ± 0% ~ (p=0.065 n=8)
Dim-4 18.07n ± 0% 18.07n ± 0% ~ (p=0.051 n=8)
Floor-4 151.6n ± 0% 151.6n ± 0% ~ (p=0.450 n=8)
Max-4 100.9n ± 8% 103.2n ± 2% ~ (p=0.099 n=8)
Min-4 116.4n ± 0% 116.4n ± 0% ~ (p=0.467 n=8)
Mod-4 959.6n ± 1% 950.9n ± 0% -0.91% (p=0.006 n=8)
Frexp-4 147.6n ± 0% 147.5n ± 0% -0.07% (p=0.026 n=8)
Gamma-4 482.7n ± 0% 478.2n ± 2% -0.92% (p=0.000 n=8)
Hypot-4 139.8n ± 1% 127.1n ± 8% -9.12% (p=0.000 n=8)
HypotGo-4 137.2n ± 7% 117.5n ± 2% -14.39% (p=0.001 n=8)
Ilogb-4 109.5n ± 0% 108.4n ± 1% -1.05% (p=0.001 n=8)
J0-4 1.304µ ± 0% 1.304µ ± 0% ~ (p=0.853 n=8)
J1-4 1.349µ ± 0% 1.331µ ± 0% -1.33% (p=0.000 n=8)
Jn-4 2.774µ ± 0% 2.750µ ± 0% -0.87% (p=0.000 n=8)
Ldexp-4 151.6n ± 0% 151.5n ± 0% ~ (p=0.695 n=8)
Lgamma-4 226.9n ± 0% 233.9n ± 0% +3.09% (p=0.000 n=8)
Log-4 407.6n ± 0% 407.4n ± 0% ~ (p=0.340 n=8)
Logb-4 121.5n ± 0% 121.5n ± 0% -0.08% (p=0.042 n=8)
Log1p-4 315.5n ± 0% 315.6n ± 0% ~ (p=0.930 n=8)
Log10-4 417.8n ± 0% 417.5n ± 0% ~ (p=0.053 n=8)
Log2-4 208.8n ± 0% 208.8n ± 0% ~ (p=0.582 n=8)
Modf-4 126.5n ± 0% 126.4n ± 0% ~ (p=0.128 n=8)
Nextafter32-4 112.45n ± 0% 82.27n ± 0% -26.84% (p=0.000 n=8)
Nextafter64-4 141.5n ± 0% 141.5n ± 0% ~ (p=0.569 n=8)
PowInt-4 754.0n ± 1% 754.6n ± 0% ~ (p=0.279 n=8)
PowFrac-4 1.608µ ± 1% 1.596µ ± 1% ~ (p=0.661 n=8)
Pow10Pos-4 18.07n ± 0% 18.07n ± 0% ~ (p=0.413 n=8)
Pow10Neg-4 17.08n ± 0% 18.07n ± 0% +5.80% (p=0.000 n=8)
Round-4 68.30n ± 0% 69.29n ± 0% +1.45% (p=0.000 n=8)
RoundToEven-4 78.33n ± 0% 78.34n ± 0% ~ (p=0.975 n=8)
Remainder-4 740.6n ± 1% 736.7n ± 0% ~ (p=0.098 n=8)
Signbit-4 18.08n ± 0% 18.07n ± 0% ~ (p=0.546 n=8)
Sin-4 389.4n ± 0% 389.5n ± 0% ~ (p=0.451 n=8)
Sincos-4 415.6n ± 0% 415.6n ± 0% ~ (p=0.450 n=8)
Sinh-4 607.0n ± 0% 590.8n ± 1% -2.68% (p=0.000 n=8)
SqrtIndirect-4 8.034n ± 0% 8.030n ± 0% ~ (p=0.487 n=8)
SqrtLatency-4 8.031n ± 0% 8.034n ± 0% ~ (p=0.152 n=8)
SqrtIndirectLatency-4 8.032n ± 0% 8.032n ± 0% ~ (p=0.818 n=8)
SqrtGoLatency-4 895.8n ± 0% 895.3n ± 0% ~ (p=0.553 n=8)
SqrtPrime-4 5.405µ ± 0% 5.379µ ± 0% -0.48% (p=0.000 n=8)
Tan-4 405.6n ± 0% 405.7n ± 0% ~ (p=0.980 n=8)
Tanh-4 545.1n ± 0% 545.1n ± 0% ~ (p=0.806 n=8)
Trunc-4 146.5n ± 0% 146.6n ± 0% ~ (p=0.380 n=8)
Y0-4 1.308µ ± 0% 1.306µ ± 0% ~ (p=0.071 n=8)
Y1-4 1.311µ ± 0% 1.315µ ± 0% +0.31% (p=0.000 n=8)
Yn-4 2.737µ ± 0% 2.745µ ± 0% +0.27% (p=0.000 n=8)
Float64bits-4 14.56n ± 0% 14.56n ± 0% ~ (p=0.689 n=8)
Float64frombits-4 19.08n ± 0% 19.08n ± 0% ~ (p=0.580 n=8)
Float32bits-4 13.050n ± 0% 5.019n ± 0% -61.54% (p=0.000 n=8)
Float32frombits-4 13.060n ± 0% 4.016n ± 0% -69.25% (p=0.000 n=8)
FMA-4 608.5n ± 0% 586.1n ± 0% -3.67% (p=0.000 n=8)
geomean 185.5n 176.2n -5.02%
Change-Id: Ibf91092ffe70104e6c5ec03bc76d51259818b9b3
Reviewed-on: https://go-review.googlesource.com/c/go/+/494535
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
|
|
f0d575c266 |
cmd/compile: optimize math.Float64(32)bits and math.Float64(32)frombits on mips64x
This CL use MFC1/MTC1 instructions to move data between GPR and FPR instead of stores and loads to move float/int values.
goos: linux
goarch: mips64le
pkg: math
│ oldmath │ newmath │
│ sec/op │ sec/op vs base │
Acos-4 258.2n ± 0% 258.2n ± 0% ~ (p=0.859 n=8)
Acosh-4 378.7n ± 0% 323.9n ± 0% -14.47% (p=0.000 n=8)
Asin-4 255.1n ± 2% 255.5n ± 0% +0.16% (p=0.002 n=8)
Asinh-4 407.1n ± 0% 348.7n ± 0% -14.35% (p=0.000 n=8)
Atan-4 189.5n ± 0% 189.9n ± 3% ~ (p=0.205 n=8)
Atanh-4 355.6n ± 0% 323.4n ± 2% -9.03% (p=0.000 n=8)
Atan2-4 284.1n ± 7% 280.1n ± 4% ~ (p=0.313 n=8)
Cbrt-4 314.3n ± 0% 236.4n ± 0% -24.79% (p=0.000 n=8)
Ceil-4 144.3n ± 3% 139.6n ± 0% ~ (p=0.069 n=8)
Compare-4 21.100n ± 0% 7.035n ± 0% -66.66% (p=0.000 n=8)
Compare32-4 20.100n ± 0% 6.030n ± 0% -70.00% (p=0.000 n=8)
Copysign-4 34.970n ± 0% 6.221n ± 0% -82.21% (p=0.000 n=8)
Cos-4 183.4n ± 3% 184.1n ± 5% ~ (p=0.159 n=8)
Cosh-4 487.9n ± 2% 419.6n ± 0% -14.00% (p=0.000 n=8)
Erf-4 160.6n ± 0% 157.9n ± 0% -1.68% (p=0.009 n=8)
Erfc-4 183.7n ± 4% 169.8n ± 0% -7.54% (p=0.000 n=8)
Erfinv-4 191.5n ± 4% 183.6n ± 0% -4.13% (p=0.023 n=8)
Erfcinv-4 192.0n ± 7% 184.3n ± 0% ~ (p=0.425 n=8)
Exp-4 398.2n ± 0% 340.1n ± 4% -14.58% (p=0.000 n=8)
ExpGo-4 383.3n ± 0% 327.3n ± 0% -14.62% (p=0.000 n=8)
Expm1-4 248.7n ± 5% 216.0n ± 0% -13.11% (p=0.000 n=8)
Exp2-4 372.8n ± 0% 316.9n ± 3% -14.98% (p=0.000 n=8)
Exp2Go-4 374.1n ± 0% 320.5n ± 0% -14.33% (p=0.000 n=8)
Abs-4 3.013n ± 0% 3.016n ± 0% +0.10% (p=0.020 n=8)
Dim-4 5.021n ± 0% 5.022n ± 0% ~ (p=0.270 n=8)
Floor-4 127.5n ± 4% 126.2n ± 3% ~ (p=0.186 n=8)
Max-4 72.32n ± 0% 61.33n ± 0% -15.20% (p=0.000 n=8)
Min-4 83.33n ± 1% 61.36n ± 0% -26.37% (p=0.000 n=8)
Mod-4 690.7n ± 0% 454.5n ± 0% -34.20% (p=0.000 n=8)
Frexp-4 116.30n ± 1% 71.80n ± 1% -38.26% (p=0.000 n=8)
Gamma-4 389.0n ± 0% 355.9n ± 1% -8.48% (p=0.000 n=8)
Hypot-4 102.40n ± 0% 83.90n ± 0% -18.07% (p=0.000 n=8)
HypotGo-4 105.45n ± 4% 84.82n ± 2% -19.56% (p=0.000 n=8)
Ilogb-4 99.13n ± 4% 63.71n ± 2% -35.73% (p=0.000 n=8)
J0-4 859.7n ± 0% 854.8n ± 0% -0.57% (p=0.000 n=8)
J1-4 873.9n ± 0% 875.7n ± 0% +0.21% (p=0.007 n=8)
Jn-4 1.855µ ± 0% 1.867µ ± 0% +0.65% (p=0.000 n=8)
Ldexp-4 130.50n ± 2% 64.35n ± 0% -50.69% (p=0.000 n=8)
Lgamma-4 208.8n ± 0% 200.9n ± 0% -3.78% (p=0.000 n=8)
Log-4 294.1n ± 0% 255.2n ± 3% -13.22% (p=0.000 n=8)
Logb-4 105.45n ± 1% 66.81n ± 1% -36.64% (p=0.000 n=8)
Log1p-4 268.2n ± 0% 211.3n ± 0% -21.21% (p=0.000 n=8)
Log10-4 295.4n ± 0% 255.2n ± 2% -13.59% (p=0.000 n=8)
Log2-4 152.9n ± 1% 127.5n ± 0% -16.61% (p=0.000 n=8)
Modf-4 103.40n ± 0% 75.36n ± 0% -27.12% (p=0.000 n=8)
Nextafter32-4 121.20n ± 1% 78.40n ± 0% -35.31% (p=0.000 n=8)
Nextafter64-4 110.40n ± 1% 64.91n ± 0% -41.20% (p=0.000 n=8)
PowInt-4 509.8n ± 1% 369.3n ± 1% -27.56% (p=0.000 n=8)
PowFrac-4 1189.0n ± 0% 947.8n ± 0% -20.29% (p=0.000 n=8)
Pow10Pos-4 15.07n ± 0% 15.07n ± 0% ~ (p=0.733 n=8)
Pow10Neg-4 20.10n ± 0% 20.10n ± 0% ~ (p=0.576 n=8)
Round-4 44.22n ± 0% 26.12n ± 0% -40.92% (p=0.000 n=8)
RoundToEven-4 46.22n ± 0% 27.12n ± 0% -41.31% (p=0.000 n=8)
Remainder-4 539.0n ± 1% 417.1n ± 1% -22.62% (p=0.000 n=8)
Signbit-4 17.985n ± 0% 5.694n ± 0% -68.34% (p=0.000 n=8)
Sin-4 185.7n ± 5% 172.9n ± 0% -6.89% (p=0.001 n=8)
Sincos-4 176.6n ± 0% 200.9n ± 0% +13.76% (p=0.000 n=8)
Sinh-4 495.8n ± 0% 435.9n ± 0% -12.09% (p=0.000 n=8)
SqrtIndirect-4 5.022n ± 0% 5.024n ± 0% ~ (p=0.083 n=8)
SqrtLatency-4 8.038n ± 0% 8.044n ± 0% ~ (p=0.524 n=8)
SqrtIndirectLatency-4 8.035n ± 0% 8.039n ± 0% +0.06% (p=0.017 n=8)
SqrtGoLatency-4 340.1n ± 0% 278.3n ± 0% -18.19% (p=0.000 n=8)
SqrtPrime-4 5.381µ ± 0% 5.386µ ± 0% ~ (p=0.662 n=8)
Tan-4 198.6n ± 1% 183.1n ± 0% -7.85% (p=0.000 n=8)
Tanh-4 491.3n ± 1% 440.8n ± 1% -10.29% (p=0.000 n=8)
Trunc-4 121.7n ± 0% 121.7n ± 0% ~ (p=0.769 n=8)
Y0-4 855.1n ± 0% 859.8n ± 0% +0.54% (p=0.007 n=8)
Y1-4 862.3n ± 0% 865.1n ± 0% +0.32% (p=0.007 n=8)
Yn-4 1.830µ ± 0% 1.837µ ± 0% +0.36% (p=0.011 n=8)
Float64bits-4 13.060n ± 0% 3.016n ± 0% -76.91% (p=0.000 n=8)
Float64frombits-4 13.060n ± 0% 3.018n ± 0% -76.90% (p=0.000 n=8)
Float32bits-4 13.060n ± 0% 3.016n ± 0% -76.91% (p=0.000 n=8)
Float32frombits-4 13.070n ± 0% 3.013n ± 0% -76.94% (p=0.000 n=8)
FMA-4 446.0n ± 0% 413.1n ± 1% -7.38% (p=0.000 n=8)
geomean 143.4n 108.3n -24.49%
Change-Id: I2067f7a5ae1126ada7ab3fb2083710e8212535e9
Reviewed-on: https://go-review.googlesource.com/c/go/+/493815
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
|
|
|
|
75add1ce0e |
cmd/compile: intrinsify runtime/internal/atomic.{And,Or} on MIPS64x
This CL intrinsify atomic{And,Or} on mips64x, which already implemented on mipsx.
goos: linux
goarch: mips64le
pkg: runtime/internal/atomic
_ oldatomic _ newatomic _
_ sec/op _ sec/op vs base _
AtomicLoad64-4 27.96n _ 0% 28.02n _ 0% +0.20% (p=0.026 n=8)
AtomicStore64-4 29.14n _ 0% 29.21n _ 0% +0.22% (p=0.004 n=8)
AtomicLoad-4 27.96n _ 0% 28.02n _ 0% ~ (p=0.220 n=8)
AtomicStore-4 29.15n _ 0% 29.21n _ 0% +0.19% (p=0.002 n=8)
And8-4 53.09n _ 0% 41.71n _ 0% -21.44% (p=0.000 n=8)
And-4 49.87n _ 0% 39.93n _ 0% -19.93% (p=0.000 n=8)
And8Parallel-4 70.45n _ 0% 68.58n _ 0% -2.65% (p=0.000 n=8)
AndParallel-4 70.40n _ 0% 67.95n _ 0% -3.47% (p=0.000 n=8)
Or8-4 52.09n _ 0% 41.11n _ 0% -21.08% (p=0.000 n=8)
Or-4 49.80n _ 0% 39.87n _ 0% -19.93% (p=0.000 n=8)
Or8Parallel-4 70.43n _ 0% 68.25n _ 0% -3.08% (p=0.000 n=8)
OrParallel-4 70.42n _ 0% 67.94n _ 0% -3.51% (p=0.000 n=8)
Xadd-4 67.83n _ 0% 67.92n _ 0% +0.13% (p=0.003 n=8)
Xadd64-4 67.85n _ 0% 67.92n _ 0% +0.09% (p=0.021 n=8)
Cas-4 81.34n _ 0% 81.37n _ 0% ~ (p=0.859 n=8)
Cas64-4 81.43n _ 0% 81.53n _ 0% +0.13% (p=0.001 n=8)
Xchg-4 67.15n _ 0% 67.18n _ 0% ~ (p=0.367 n=8)
Xchg64-4 67.16n _ 0% 67.21n _ 0% +0.08% (p=0.008 n=8)
geomean 54.04n 51.01n -5.61%
Change-Id: I9a4353f4b14134f1e9cf0dcf99db3feb951328ed
Reviewed-on: https://go-review.googlesource.com/c/go/+/494875
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Junxian Zhu <zhujunxian@oss.cipunited.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
|
|
5cad8d41ca |
math: optimize math.Abs on mipsx
This commit optimized math.Abs function implementation on mipsx.
Tested on loongson 3A2000.
goos: linux
goarch: mipsle
pkg: math
│ oldmath │ newmath │
│ sec/op │ sec/op vs base │
Acos-4 282.6n ± 0% 282.3n ± 0% ~ (p=0.140 n=7)
Acosh-4 506.1n ± 0% 451.8n ± 0% -10.73% (p=0.001 n=7)
Asin-4 272.3n ± 0% 272.2n ± 0% ~ (p=0.808 n=7)
Asinh-4 529.7n ± 0% 475.3n ± 0% -10.27% (p=0.001 n=7)
Atan-4 208.2n ± 0% 207.9n ± 0% ~ (p=0.134 n=7)
Atanh-4 503.4n ± 1% 449.7n ± 0% -10.67% (p=0.001 n=7)
Atan2-4 310.5n ± 0% 310.5n ± 0% ~ (p=0.928 n=7)
Cbrt-4 359.3n ± 0% 358.8n ± 0% ~ (p=0.121 n=7)
Ceil-4 203.9n ± 0% 204.0n ± 0% ~ (p=0.600 n=7)
Compare-4 23.11n ± 0% 23.11n ± 0% ~ (p=0.702 n=7)
Compare32-4 19.09n ± 0% 19.12n ± 0% ~ (p=0.070 n=7)
Copysign-4 33.20n ± 0% 34.02n ± 0% +2.47% (p=0.001 n=7)
Cos-4 422.5n ± 0% 385.4n ± 1% -8.78% (p=0.001 n=7)
Cosh-4 628.0n ± 0% 545.5n ± 0% -13.14% (p=0.001 n=7)
Erf-4 193.7n ± 2% 192.7n ± 1% ~ (p=0.430 n=7)
Erfc-4 192.8n ± 1% 193.0n ± 0% ~ (p=0.245 n=7)
Erfinv-4 220.7n ± 1% 221.5n ± 2% ~ (p=0.272 n=7)
Erfcinv-4 221.3n ± 1% 220.4n ± 2% ~ (p=0.738 n=7)
Exp-4 471.4n ± 0% 435.1n ± 0% -7.70% (p=0.001 n=7)
ExpGo-4 470.6n ± 0% 434.0n ± 0% -7.78% (p=0.001 n=7)
Expm1-4 243.1n ± 0% 243.4n ± 0% ~ (p=0.417 n=7)
Exp2-4 463.1n ± 0% 427.0n ± 0% -7.80% (p=0.001 n=7)
Exp2Go-4 462.4n ± 0% 426.2n ± 5% -7.83% (p=0.001 n=7)
Abs-4 37.000n ± 0% 8.039n ± 9% -78.27% (p=0.001 n=7)
Dim-4 18.09n ± 0% 18.11n ± 0% ~ (p=0.094 n=7)
Floor-4 151.9n ± 0% 151.8n ± 0% ~ (p=0.190 n=7)
Max-4 116.7n ± 1% 116.7n ± 1% ~ (p=0.842 n=7)
Min-4 116.6n ± 1% 116.6n ± 0% ~ (p=0.464 n=7)
Mod-4 1244.0n ± 0% 980.9n ± 0% -21.15% (p=0.001 n=7)
Frexp-4 199.0n ± 0% 146.7n ± 0% -26.28% (p=0.001 n=7)
Gamma-4 516.4n ± 0% 479.3n ± 1% -7.18% (p=0.001 n=7)
Hypot-4 169.8n ± 0% 117.8n ± 2% -30.62% (p=0.001 n=7)
HypotGo-4 170.8n ± 0% 117.5n ± 0% -31.21% (p=0.001 n=7)
Ilogb-4 160.8n ± 0% 109.5n ± 0% -31.90% (p=0.001 n=7)
J0-4 1.359µ ± 0% 1.305µ ± 0% -3.97% (p=0.001 n=7)
J1-4 1.386µ ± 0% 1.334µ ± 0% -3.75% (p=0.001 n=7)
Jn-4 2.864µ ± 0% 2.758µ ± 0% -3.70% (p=0.001 n=7)
Ldexp-4 202.9n ± 0% 151.7n ± 0% -25.23% (p=0.001 n=7)
Lgamma-4 234.0n ± 0% 234.3n ± 0% ~ (p=0.199 n=7)
Log-4 444.1n ± 0% 407.9n ± 0% -8.15% (p=0.001 n=7)
Logb-4 157.8n ± 0% 121.6n ± 0% -22.94% (p=0.001 n=7)
Log1p-4 354.8n ± 0% 315.4n ± 0% -11.10% (p=0.001 n=7)
Log10-4 453.9n ± 0% 417.9n ± 0% -7.93% (p=0.001 n=7)
Log2-4 245.3n ± 0% 209.1n ± 0% -14.76% (p=0.001 n=7)
Modf-4 126.6n ± 0% 126.6n ± 0% ~ (p=0.126 n=7)
Nextafter32-4 112.5n ± 0% 112.5n ± 0% ~ (p=0.853 n=7)
Nextafter64-4 141.7n ± 0% 141.6n ± 0% ~ (p=0.331 n=7)
PowInt-4 878.8n ± 1% 758.3n ± 1% -13.71% (p=0.001 n=7)
PowFrac-4 1.809µ ± 0% 1.615µ ± 0% -10.72% (p=0.001 n=7)
Pow10Pos-4 18.10n ± 0% 18.12n ± 0% ~ (p=0.464 n=7)
Pow10Neg-4 17.09n ± 0% 17.09n ± 0% ~ (p=0.263 n=7)
Round-4 68.36n ± 0% 68.33n ± 0% ~ (p=0.325 n=7)
RoundToEven-4 78.40n ± 0% 78.40n ± 0% ~ (p=0.934 n=7)
Remainder-4 894.0n ± 1% 753.4n ± 1% -15.73% (p=0.001 n=7)
Signbit-4 18.09n ± 0% 18.09n ± 0% ~ (p=0.761 n=7)
Sin-4 389.8n ± 1% 389.8n ± 0% ~ (p=0.995 n=7)
Sincos-4 416.0n ± 0% 415.9n ± 0% ~ (p=0.361 n=7)
Sinh-4 634.6n ± 4% 585.6n ± 1% -7.72% (p=0.001 n=7)
SqrtIndirect-4 8.035n ± 0% 8.036n ± 0% ~ (p=0.523 n=7)
SqrtLatency-4 8.039n ± 0% 8.037n ± 0% ~ (p=0.218 n=7)
SqrtIndirectLatency-4 8.040n ± 0% 8.040n ± 0% ~ (p=0.652 n=7)
SqrtGoLatency-4 895.7n ± 0% 896.6n ± 0% +0.10% (p=0.004 n=7)
SqrtPrime-4 5.406µ ± 0% 5.407µ ± 0% ~ (p=0.592 n=7)
Tan-4 406.1n ± 0% 405.8n ± 1% ~ (p=0.435 n=7)
Tanh-4 627.6n ± 0% 545.5n ± 0% -13.08% (p=0.001 n=7)
Trunc-4 146.7n ± 1% 146.7n ± 0% ~ (p=0.755 n=7)
Y0-4 1.359µ ± 0% 1.310µ ± 0% -3.61% (p=0.001 n=7)
Y1-4 1.351µ ± 0% 1.301µ ± 0% -3.70% (p=0.001 n=7)
Yn-4 2.829µ ± 0% 2.729µ ± 0% -3.53% (p=0.001 n=7)
Float64bits-4 14.08n ± 0% 14.07n ± 0% ~ (p=0.069 n=7)
Float64frombits-4 19.09n ± 0% 19.10n ± 0% ~ (p=0.755 n=7)
Float32bits-4 13.06n ± 0% 13.07n ± 1% ~ (p=0.586 n=7)
Float32frombits-4 13.06n ± 0% 13.06n ± 0% ~ (p=0.853 n=7)
FMA-4 606.9n ± 0% 606.8n ± 0% ~ (p=0.393 n=7)
geomean 201.1n 185.4n -7.81%
Change-Id: I6d41a97ad3789ed5731588588859ac0b8b13b664
Reviewed-on: https://go-review.googlesource.com/c/go/+/484675
Reviewed-by: Rong Zhang <rongrong@oss.cipunited.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Than McIntosh <thanm@google.com>
|
|
|
|
574431cfcd |
math: optimize math.Abs on mips64x
This commit optimized math.Abs function implementation on mips64x.
Tested on loongson 3A2000.
goos: linux
goarch: mips64le
pkg: math
│ oldmath │ newmath │
│ sec/op │ sec/op vs base │
Acos-4 258.0n ± ∞ ¹ 257.1n ± ∞ ¹ -0.35% (p=0.008 n=5)
Acosh-4 417.0n ± ∞ ¹ 377.9n ± ∞ ¹ -9.38% (p=0.008 n=5)
Asin-4 248.0n ± ∞ ¹ 259.9n ± ∞ ¹ +4.80% (p=0.008 n=5)
Asinh-4 439.6n ± ∞ ¹ 408.3n ± ∞ ¹ -7.12% (p=0.008 n=5)
Atan-4 189.6n ± ∞ ¹ 188.8n ± ∞ ¹ ~ (p=0.056 n=5)
Atanh-4 390.0n ± ∞ ¹ 356.4n ± ∞ ¹ -8.62% (p=0.008 n=5)
Atan2-4 279.0n ± ∞ ¹ 263.9n ± ∞ ¹ -5.41% (p=0.008 n=5)
Cbrt-4 314.2n ± ∞ ¹ 322.3n ± ∞ ¹ +2.58% (p=0.008 n=5)
Ceil-4 139.7n ± ∞ ¹ 136.6n ± ∞ ¹ -2.22% (p=0.008 n=5)
Compare-4 21.11n ± ∞ ¹ 21.09n ± ∞ ¹ ~ (p=0.405 n=5)
Compare32-4 20.10n ± ∞ ¹ 20.12n ± ∞ ¹ ~ (p=0.206 n=5)
Copysign-4 32.17n ± ∞ ¹ 35.71n ± ∞ ¹ +11.00% (p=0.008 n=5)
Cos-4 222.8n ± ∞ ¹ 169.8n ± ∞ ¹ -23.79% (p=0.008 n=5)
Cosh-4 550.2n ± ∞ ¹ 477.4n ± ∞ ¹ -13.23% (p=0.008 n=5)
Erf-4 171.6n ± ∞ ¹ 174.5n ± ∞ ¹ ~ (p=0.635 n=5)
Erfc-4 182.6n ± ∞ ¹ 170.2n ± ∞ ¹ -6.79% (p=0.008 n=5)
Erfinv-4 177.6n ± ∞ ¹ 196.6n ± ∞ ¹ +10.70% (p=0.008 n=5)
Erfcinv-4 177.8n ± ∞ ¹ 197.8n ± ∞ ¹ +11.25% (p=0.008 n=5)
Exp-4 422.8n ± ∞ ¹ 382.1n ± ∞ ¹ -9.63% (p=0.008 n=5)
ExpGo-4 416.1n ± ∞ ¹ 383.2n ± ∞ ¹ -7.91% (p=0.008 n=5)
Expm1-4 232.9n ± ∞ ¹ 252.2n ± ∞ ¹ +8.29% (p=0.008 n=5)
Exp2-4 404.8n ± ∞ ¹ 389.1n ± ∞ ¹ -3.88% (p=0.008 n=5)
Exp2Go-4 407.0n ± ∞ ¹ 372.3n ± ∞ ¹ -8.53% (p=0.008 n=5)
Abs-4 30.120n ± ∞ ¹ 3.014n ± ∞ ¹ -89.99% (p=0.008 n=5)
Dim-4 5.021n ± ∞ ¹ 5.023n ± ∞ ¹ ~ (p=0.071 n=5)
Floor-4 127.8n ± ∞ ¹ 127.1n ± ∞ ¹ -0.55% (p=0.008 n=5)
Max-4 77.69n ± ∞ ¹ 76.33n ± ∞ ¹ -1.75% (p=0.008 n=5)
Min-4 83.27n ± ∞ ¹ 77.87n ± ∞ ¹ -6.48% (p=0.008 n=5)
Mod-4 906.2n ± ∞ ¹ 692.9n ± ∞ ¹ -23.54% (p=0.008 n=5)
Frexp-4 150.6n ± ∞ ¹ 108.6n ± ∞ ¹ -27.89% (p=0.008 n=5)
Gamma-4 418.4n ± ∞ ¹ 386.1n ± ∞ ¹ -7.72% (p=0.008 n=5)
Hypot-4 148.20n ± ∞ ¹ 93.78n ± ∞ ¹ -36.72% (p=0.008 n=5)
HypotGo-4 148.20n ± ∞ ¹ 94.47n ± ∞ ¹ -36.26% (p=0.008 n=5)
Ilogb-4 135.50n ± ∞ ¹ 92.38n ± ∞ ¹ -31.82% (p=0.008 n=5)
J0-4 937.7n ± ∞ ¹ 861.7n ± ∞ ¹ -8.10% (p=0.008 n=5)
J1-4 915.4n ± ∞ ¹ 875.9n ± ∞ ¹ -4.32% (p=0.008 n=5)
Jn-4 1.974µ ± ∞ ¹ 1.863µ ± ∞ ¹ -5.62% (p=0.008 n=5)
Ldexp-4 158.5n ± ∞ ¹ 129.3n ± ∞ ¹ -18.42% (p=0.008 n=5)
Lgamma-4 209.0n ± ∞ ¹ 211.8n ± ∞ ¹ ~ (p=0.095 n=5)
Log-4 326.4n ± ∞ ¹ 295.2n ± ∞ ¹ -9.56% (p=0.008 n=5)
Logb-4 147.7n ± ∞ ¹ 105.0n ± ∞ ¹ -28.91% (p=0.008 n=5)
Log1p-4 303.4n ± ∞ ¹ 266.3n ± ∞ ¹ -12.23% (p=0.008 n=5)
Log10-4 329.2n ± ∞ ¹ 298.3n ± ∞ ¹ -9.39% (p=0.008 n=5)
Log2-4 187.4n ± ∞ ¹ 153.0n ± ∞ ¹ -18.36% (p=0.008 n=5)
Modf-4 110.5n ± ∞ ¹ 103.5n ± ∞ ¹ -6.33% (p=0.008 n=5)
Nextafter32-4 128.4n ± ∞ ¹ 121.5n ± ∞ ¹ -5.37% (p=0.016 n=5)
Nextafter64-4 109.5n ± ∞ ¹ 110.5n ± ∞ ¹ +0.91% (p=0.008 n=5)
PowInt-4 603.3n ± ∞ ¹ 516.4n ± ∞ ¹ -14.40% (p=0.008 n=5)
PowFrac-4 1.365µ ± ∞ ¹ 1.183µ ± ∞ ¹ -13.33% (p=0.008 n=5)
Pow10Pos-4 15.07n ± ∞ ¹ 15.07n ± ∞ ¹ ~ (p=0.738 n=5)
Pow10Neg-4 21.11n ± ∞ ¹ 21.10n ± ∞ ¹ ~ (p=0.190 n=5)
Round-4 44.23n ± ∞ ¹ 44.22n ± ∞ ¹ ~ (p=0.635 n=5)
RoundToEven-4 50.25n ± ∞ ¹ 46.27n ± ∞ ¹ -7.92% (p=0.008 n=5)
Remainder-4 675.6n ± ∞ ¹ 530.4n ± ∞ ¹ -21.49% (p=0.008 n=5)
Signbit-4 17.07n ± ∞ ¹ 17.95n ± ∞ ¹ +5.16% (p=0.008 n=5)
Sin-4 171.6n ± ∞ ¹ 189.1n ± ∞ ¹ +10.20% (p=0.008 n=5)
Sincos-4 201.5n ± ∞ ¹ 200.5n ± ∞ ¹ ~ (p=0.421 n=5)
Sinh-4 529.6n ± ∞ ¹ 484.6n ± ∞ ¹ -8.50% (p=0.008 n=5)
SqrtIndirect-4 5.021n ± ∞ ¹ 5.023n ± ∞ ¹ +0.04% (p=0.048 n=5)
SqrtLatency-4 8.032n ± ∞ ¹ 8.039n ± ∞ ¹ +0.09% (p=0.024 n=5)
SqrtIndirectLatency-4 8.036n ± ∞ ¹ 8.038n ± ∞ ¹ ~ (p=0.056 n=5)
SqrtGoLatency-4 338.8n ± ∞ ¹ 338.7n ± ∞ ¹ ~ (p=0.841 n=5)
SqrtPrime-4 5.379µ ± ∞ ¹ 5.382µ ± ∞ ¹ +0.06% (p=0.048 n=5)
Tan-4 182.7n ± ∞ ¹ 191.8n ± ∞ ¹ +4.98% (p=0.008 n=5)
Tanh-4 558.7n ± ∞ ¹ 497.6n ± ∞ ¹ -10.94% (p=0.008 n=5)
Trunc-4 122.5n ± ∞ ¹ 122.6n ± ∞ ¹ ~ (p=0.405 n=5)
Y0-4 892.8n ± ∞ ¹ 851.7n ± ∞ ¹ -4.60% (p=0.008 n=5)
Y1-4 887.2n ± ∞ ¹ 863.2n ± ∞ ¹ -2.71% (p=0.008 n=5)
Yn-4 1.889µ ± ∞ ¹ 1.832µ ± ∞ ¹ -3.02% (p=0.008 n=5)
Float64bits-4 13.05n ± ∞ ¹ 13.06n ± ∞ ¹ +0.08% (p=0.040 n=5)
Float64frombits-4 13.05n ± ∞ ¹ 13.06n ± ∞ ¹ ~ (p=0.143 n=5)
Float32bits-4 13.05n ± ∞ ¹ 13.06n ± ∞ ¹ +0.08% (p=0.008 n=5)
Float32frombits-4 13.05n ± ∞ ¹ 13.08n ± ∞ ¹ +0.23% (p=0.016 n=5)
FMA-4 445.7n ± ∞ ¹ 448.1n ± ∞ ¹ +0.54% (p=0.008 n=5)
geomean 157.2n 142.8n -9.17%
Change-Id: I9bf104848b588c9ecf79401a81d483d7fcdb0a79
Reviewed-on: https://go-review.googlesource.com/c/go/+/481575
Reviewed-by: M Zhuo <mzh@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Than McIntosh <thanm@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Than McIntosh <thanm@google.com>
Reviewed-by: Rong Zhang <rongrong@oss.cipunited.com>
|
|
|
|
cedf5008a8 |
cmd/compile: introduce separate memory op combining pass
Memory op combining is currently done using arch-specific rewrite rules. Instead, do them as a arch-independent rewrite pass. This ensures that all architectures (with unaligned loads & stores) get equal treatment. This removes a lot of rewrite rules. The new pass is a bit more comprehensive. It handles things like out-of-order writes and is careful not to apply partial optimizations that then block further optimizations. Change-Id: I780ff3bb052475cd725a923309616882d25b8d9e Reviewed-on: https://go-review.googlesource.com/c/go/+/478475 Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com> |
|
|
|
96428e160d |
cmd/compile: split DIVV/DIVVU op on loong64
Previously, we need calculate both quotient and remainder together. However, in most cases, only one result is needed. By separating these instructions, we can save one instruction in most cases. Change-Id: I0a2d4167cda68ab606783ba1aa2720ede19d6b53 Reviewed-on: https://go-review.googlesource.com/c/go/+/475315 Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> |
|
|
|
42f99b203d |
cmd/compile: optimize cmp to cmn under conditions < and >= on arm64
Under the right conditions we can optimize cmp comparisons to cmn
comparisons, such as:
func foo(a, b int) int {
var c int
if a + b < 0 {
c = 1
}
return c
}
Previously it's compiled as:
ADD R1, R0, R1
CMP $0, R1
CSET LT, R0
With this CL it's compiled as:
CMN R1, R0
CSET MI, R0
Here we need to pay attention to the overflow situation of a+b, the MI
flag means N==1, which doesn't honor the overflow flag V, its value
depends only on the sign of the result. So it has the same semantic of
the Go code, so it's correct.
Similarly, this CL also optimizes the case of >= comparison
using the PL conditional flag.
Change-Id: I47179faba5b30cca84ea69bafa2ad5241bf6dfba
Reviewed-on: https://go-review.googlesource.com/c/go/+/476116
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
|
|
3360be4a11 |
cmd/compile: fix extraneous diff in generated files
Looks like CL 475735 contained a not-quite-up-to-date version of the generated file. Maybe ABSFL was in an earlier version of the CL and was removed before checkin without regenerating the generated file? In any case, update the generated file. Shouldn't cause a problem, as that field isn't used in x86/ssa.go. Change-Id: I3f0b7d41081ba3ce2cdcae385fea16b37d7de81b Reviewed-on: https://go-review.googlesource.com/c/go/+/477096 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> |
|
|
|
cedfcba3e8 |
cmd/compile: instrinsify TrailingZeros{8,32,64} for 386
This CL add support for instrinsifying the TrialingZeros{8,32,64}
functions for 386 architecture. We need handle the case when the input
is 0, which could lead to undefined output from the BSFL instruction.
Next CL will remove the assembly code in runtime/internal/sys package.
Change-Id: Ic168edf68e81bf69a536102100fdd3f56f0f4a1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/475735
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
|
|
14015be5bb |
cmd/compile: optimize multiplication on loong64
Previously, multiplication on loong64 architecture was performed using MULV and MULHVU instructions to calculate the low 64-bit and high 64-bit of a multiplication respectively. However, in most cases, only the low 64-bits are needed. This commit enalbes only computating the low 64-bit result with the MULV instruction. Reduce the binary size slightly. file before after Δ % addr2line 2833777 2833849 +72 +0.003% asm 5267499 5266963 -536 -0.010% buildid 2579706 2579402 -304 -0.012% cgo 4798260 4797444 -816 -0.017% compile 25247419 25175030 -72389 -0.287% cover 4973091 4972027 -1064 -0.021% dist 3631013 3565653 -65360 -1.800% doc 4076036 4074004 -2032 -0.050% fix 3496378 3496066 -312 -0.009% link 6984102 6983214 -888 -0.013% nm 2743820 2743516 -304 -0.011% objdump 4277171 4277035 -136 -0.003% pack 2379248 2378872 -376 -0.016% pprof 14419090 14419874 +784 +0.005% test2json 2684386 2684018 -368 -0.014% trace 13640018 13631034 -8984 -0.066% vet 7748918 7752630 +3712 +0.048% go 15643850 15638098 -5752 -0.037% total 127423782 127268729 -155053 -0.122% Change-Id: Ifce4a9a3ed1d03c170681e39cb6f3541db9882dc Reviewed-on: https://go-review.googlesource.com/c/go/+/472775 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: David Chase <drchase@google.com> |
|
|
|
21d82e6ac8 |
cmd/compile: batch write barrier calls
Have the write barrier call return a pointer to a buffer into which the generated code records pointers that need write barrier treatment. Change-Id: I7871764298e0aa1513de417010c8d46b296b199e Reviewed-on: https://go-review.googlesource.com/c/go/+/447781 Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Bypass: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> |
|
|
|
44d22e75dd |
cmd/compile: detect write barrier completion differently
Instead of keeping track of in which blocks write barriers complete, introduce a new op that marks the exact memory state where the write barrier completes. For future use. This allows us to move some of the write barrier code to between the start of the merging block and the WBend marker. Change-Id: If3809b260292667d91bf0ee18d7b4d0eb1e929f0 Reviewed-on: https://go-review.googlesource.com/c/go/+/447777 Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Keith Randall <khr@golang.org> |
|
|
|
f9da938614 |
cmd/compile: remove unused ISELB PPC64 ssa opcode
The usage of ISELB has been removed as part of changes made to support Power10 SETBC instructions. Change-Id: I2fce4370f48c1eeee65d411dfd1bea4201f45b45 Reviewed-on: https://go-review.googlesource.com/c/go/+/465575 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Paul Murphy <murp@ibm.com> Reviewed-by: Archana Ravindar <aravind5@in.ibm.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> |
|
|
|
a432d89137 |
cmd/compile: add rules to emit SETBC/R instructions on power10
This CL adds rules that replaces instances of ISEL that produce a boolean result based on a condition register by SETBC/SETBCR operations. On Power10 these are convereted to SETBC/SETBCR instructions that use one register instead of 3 registers conventionally used by ISEL and hence reduces register pressure. On loops written specifically to exercise such instances of ISEL extensively, a performance improvement of 2.5% is seen on Power10. Also added verification tests to verify correct generation of SETBC/SETBCR instructions on Power10. Change-Id: Ib719897f09d893de40324440a43052dca026e8fa Reviewed-on: https://go-review.googlesource.com/c/go/+/449795 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Archana Ravindar <aravind5@in.ibm.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> |
|
|
|
cd1fc87156 |
cmd/compile: intrinsify math/bits/ReverseBytes{16|32|64} for ppc64/power10
This change intrinsifies ReverseBytes{16|32|64} by generating the
corresponding new instructions in Power10: brh, brd and brw and
adds a verification test for the same.
On Power 9 and 8, the .go code performs optimally as it is.
Performance improvement seen on Power10:
ReverseBytes32 1.38ns ± 0% 1.18ns ± 0% -14.2
ReverseBytes64 1.52ns ± 0% 1.11ns ± 0% -26.87
ReverseBytes16 1.41ns ± 1% 1.18ns ± 0% -16.47
Change-Id: I88f127f3ab9ba24a772becc21ad90acfba324b37
Reviewed-on: https://go-review.googlesource.com/c/go/+/446675
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
|
|
5c67ebbb31 |
cmd/compile: AMD64v3 remove unnecessary TEST comparision in isPowerOfTwo
With GOAMD64=V3 the canonical isPowerOfTwo function:
func isPowerOfTwo(x uintptr) bool {
return x&(x-1) == 0
}
Used to compile to:
temp := BLSR(x) // x&(x-1)
flags = TEST(temp, temp)
return flags.zf
However the blsr instruction already set ZF according to the result.
So we can remove the TEST instruction if we are just checking ZF.
Such as in multiple pieces of code around memory allocations.
This make the code smaller and faster.
Change-Id: Ia12d5a73aa3cb49188c0b647b1eff7b56c5a7b58
Reviewed-on: https://go-review.googlesource.com/c/go/+/448255
Run-TryBot: Jakub Ciolek <jakub@ciolek.dev>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
|
|
12befc3ce3 |
cmd/compile: improve scheduling pass
Convert the scheduling pass from scheduling backwards to scheduling forwards. Forward scheduling makes it easier to prioritize scheduling values as soon as they are ready, which is important for things like nil checks, select ops, etc. Forward scheduling is also quite a bit clearer. It was originally backwards because computing uses is tricky, but I found a way to do it simply and with n lg n complexity. The new scheme also makes it easy to add new scheduling edges if needed. Fixes #42673 Update #56568 Change-Id: Ibbb38c52d191f50ce7a94f8c1cbd3cd9b614ea8b Reviewed-on: https://go-review.googlesource.com/c/go/+/270940 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com> |
|
|
|
45dc81d856 |
cmd/compile: add memory argument to GetCallerSP
We need to make sure that when we get the stack pointer, we get it at the right time. V = GetCallerSP Call() W = GetCallerSP If Call causes a stack growth, then we will be in a situation where V != W. So it matters when GetCallerSP operations get scheduled. Add a memory argument to GetCallerSP so it can't be reordered with things like calls. Change-Id: I6cc801134c38e358c5a1ec0c09d38379a16a4184 Reviewed-on: https://go-review.googlesource.com/c/go/+/453515 Reviewed-by: Martin Möhrmann <moehrmann@google.com> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <martin@golang.org> Reviewed-by: Robert Griesemer <gri@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> |
|
|
|
f959fb3872 |
cmd/compile: add anchored version of SP
The SPanchored opcode is identical to SP, except that it takes a memory argument so that it (and more importantly, anything that uses it) must be scheduled at or after that memory argument. This opcode ensures that a LEAQ of a variable gets scheduled after the corresponding VARDEF for that variable. This may lead to less CSE of LEAQ operations. The effect is very small. The go binary is only 80 bytes bigger after this CL. Usually LEAQs get folded into load/store operations, so the effect is only for pointerful types, large enough to need a duffzero, and have their address passed somewhere. Even then, usually the CSEd LEAQs will be un-CSEd because the two uses are on different sides of a function call and the LEAQ ends up being rematerialized at the second use anyway. Change-Id: Ib893562cd05369b91dd563b48fb83f5250950293 Reviewed-on: https://go-review.googlesource.com/c/go/+/452916 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Martin Möhrmann <martin@golang.org> Reviewed-by: Keith Randall <khr@google.com> |
|
|
|
47a0d46716 |
cmd/compile/internal/ssa: generate code via a //go:generate directive
The standard way to generate code in a Go package is via //go:generate directives, which are invoked by the developer explicitly running: go generate import/path/of/said/package Switch to using that approach here. This way, developers don't need to learn and remember a custom way that each particular Go package may choose to implement its code generation. It also enables conveniences such as 'go generate -n' to discover how code is generated without running anything (this works on all packages that rely on //go:generate directives), being able to generate multiple packages at once and from any directory, and so on. Change-Id: I0e5b6a1edeff670a8e588befeef0c445613803c7 Reviewed-on: https://go-review.googlesource.com/c/go/+/460135 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org> Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> |
|
|
|
5f7abeca5a |
cmd/compile: teach regalloc about temporary registers
Temporary registers are sometimes needed for an architecture backend which needs to use several machine instructions to implement a single SSA instruction. Mark such instructions so that regalloc can reserve the temporary register for it. That way we don't have to reserve a fixed register like we do now. Convert the temp-register-using instructions on amd64 to use this new mechanism. Other archs can follow as needed. Change-Id: I1d0c8588afdad5cd18b4398eb5a0f755be5dead7 Reviewed-on: https://go-review.googlesource.com/c/go/+/398556 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com> |