mirror/go - go - Git Fam. Sieh

Commit Graph

Author	SHA1	Message	Date
Xiaolin Zhao	ff14e08cd3	cmd/compile, math: improve implementation of math.{Max,Min} on loong64 Make math.{Min,Max} intrinsics and implement math.{archMax,archMin} in hardware. goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Max 7.606n ± 0% 3.087n ± 0% -59.41% (p=0.000 n=20) Min 7.205n ± 0% 2.904n ± 0% -59.69% (p=0.000 n=20) MinFloat 37.220n ± 0% 4.802n ± 0% -87.10% (p=0.000 n=20) MaxFloat 33.620n ± 0% 4.802n ± 0% -85.72% (p=0.000 n=20) geomean 16.18n 3.792n -76.57% goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3A5000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Max 10.010n ± 0% 7.196n ± 0% -28.11% (p=0.000 n=20) Min 8.806n ± 0% 7.155n ± 0% -18.75% (p=0.000 n=20) MinFloat 60.010n ± 0% 7.976n ± 0% -86.71% (p=0.000 n=20) MaxFloat 56.410n ± 0% 7.980n ± 0% -85.85% (p=0.000 n=20) geomean 23.37n 7.566n -67.63% Updates #59120. Change-Id: I6815d20bc304af3cbf5d6ca8fe0ca1c2ddebea2d Reviewed-on: https://go-review.googlesource.com/c/go/+/580283 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com>	2024-08-07 01:16:28 +00:00
Keith Randall	c18ff29295	cmd/compile: make sync/atomic AND/OR operations intrinsic on amd64 Update #61395 Change-Id: I59a950f48efc587dfdffce00e2f4f3ab99d8df00 Reviewed-on: https://go-review.googlesource.com/c/go/+/594738 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Nicolas Hillegeer <aktau@google.com>	2024-07-23 21:29:38 +00:00
Keith Randall	dbfa3cacc7	cmd/compile: fix typing of atomic logical operations For atomic AND and OR operations on memory, we currently have two views of the op. One just does the operation on the memory and returns just a memory. The other does the operation on the memory and returns the old value (before having the logical operation done to it) and memory. Update #61395 These two type differently, and there's currently some confusion in our rules about which is which. Use different names for the two different flavors so we don't get them confused. Change-Id: I07b4542db672b2cee98169ac42b67db73c482093 Reviewed-on: https://go-review.googlesource.com/c/go/+/594976 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Nicolas Hillegeer <aktau@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Keith Randall <khr@google.com>	2024-07-23 21:27:54 +00:00
fanzha02	63c1e141bc	cmd/compile: intrinsify atomic And/Or on arm64 The atomic And/Or operators were added by the CL 528797, the compiler does not intrinsify them, this CL does it for arm64. Also, for the existing atomicAnd/Or operations, the updated value are not used, but at that time we need a register to temporarily hold it. Now that we have v.RegTmp, the new value is not needed anymore. This CL changes it. The other change is that the existing operations don't use their result, but now we need the old value and not the new value for the result. And this CL alias all of the And/Or operations into sync/atomic package. Peformance on an ARMv8.1 machine: old.txt new.txt sec/op sec/op vs base And32-160 8.716n ± 0% 4.771n ± 1% -45.26% (p=0.000 n=10) And32Parallel-160 30.58n ± 2% 26.45n ± 4% -13.49% (p=0.000 n=10) And64-160 8.750n ± 1% 4.754n ± 0% -45.67% (p=0.000 n=10) And64Parallel-160 29.40n ± 3% 25.55n ± 5% -13.11% (p=0.000 n=10) Or32-160 8.847n ± 1% 4.754±1% -46.26% (p=0.000 n=10) Or32Parallel-160 30.75n ± 3% 26.10n ± 4% -15.14% (p=0.000 n=10) Or64-160 8.825n ± 1% 4.766n ± 0% -46.00% (p=0.000 n=10) Or64Parallel-160 30.52n ± 5% 25.89n ± 6% -15.17% (p=0.000 n=10) For #61395 Change-Id: Ib1d1ac83f7f67dcf67f74d003fadb0f80932b826 Reviewed-on: https://go-review.googlesource.com/c/go/+/584715 Auto-Submit: Austin Clements <austin@google.com> TryBot-Bypass: Austin Clements <austin@google.com> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Fannie Zhang <Fannie.Zhang@arm.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2024-05-23 15:49:20 +00:00
Paul E. Murphy	dca577d882	cmd/compile/internal/ssa: reintroduce ANDconst opcode on PPC64 This allows more effective conversion of rotate and mask opcodes into their CC equivalents, while simplifying the first lowering pass. This was removed before the latelower pass was introduced to fold more cases of compare against zero. Add ANDconst to push the conversion of ANDconst to ANDCCconst into latelower with the other CC opcodes. This also requires introducing RLDICLCC to prevent regressions when ANDconst is converted to RLDICL then to RLDICLCC and back to ANDCCconst when possible. Change-Id: I9e5f9c99fbefa334db18c6c152c5f967f3ff2590 Reviewed-on: https://go-review.googlesource.com/c/go/+/586160 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org>	2024-05-22 19:59:38 +00:00
Paul E. Murphy	dfb17c126c	cmd/compile: support float min/max instructions on PPC64 This enables efficient use of the builtin min/max function for float64 and float32 types on GOPPC64 >= power9. Extend the assembler to support xsminjdp/xsmaxjdp and use them to implement float min/max. Simplify the VSX xx3 opcode rules to allow FPR arguments, if all arguments are an FPR. Change-Id: I15882a4ce5dc46eba71d683cf1d184dc4236a328 Reviewed-on: https://go-review.googlesource.com/c/go/+/574535 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Paul Murphy <murp@ibm.com> Reviewed-by: Than McIntosh <thanm@google.com>	2024-04-01 18:50:29 +00:00
Paul E. Murphy	c7065bb9db	cmd/compile/internal: generate ADDZE on PPC64 This usage shows up in quite a few places, and helps reduce register pressure in several complex cryto functions by removing a MOVD $0,... instruction. Change-Id: I9444ea8f9d19bfd68fb71ea8dc34e109681b3802 Reviewed-on: https://go-review.googlesource.com/c/go/+/571055 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Run-TryBot: Paul Murphy <murp@ibm.com>	2024-03-15 17:57:45 +00:00
Joel Sing	997636760e	cmd/compile,cmd/internal/obj: provide rotation pseudo-instructions for riscv64 Provide and use rotation pseudo-instructions for riscv64. The RISC-V bitmanip extension adds support for hardware rotation instructions in the form of ROL, ROLW, ROR, RORI, RORIW and RORW. These are easily implemented in the assembler as pseudo-instructions for CPUs that do not support the bitmanip extension. This approach provides a number of advantages, including reducing the rewrite rules needed in the compiler, simplifying codegen tests and most importantly, allowing these instructions to be used in assembly (for example, riscv64 optimised versions of SHA-256 and SHA-512). When bitmanip support is added, these instruction sequences can simply be replaced with a single instruction if permitted by the GORISCV64 profile. Change-Id: Ia23402e1a82f211ac760690deb063386056ae1fa Reviewed-on: https://go-review.googlesource.com/c/go/+/565015 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: M Zhuo <mengzhuo1203@gmail.com> Reviewed-by: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Run-TryBot: Joel Sing <joel@sing.id.au>	2024-03-07 14:57:07 +00:00
Joel Sing	daa58db486	cmd/compile: improve rotations for riscv64 Enable canRotate for riscv64, enable rotation intrinsics and provide better rewrite implementations for rotations. By avoiding Lshx64 and RshUx64 we can produce better code, especially for 32 and 64 bit rotations. By enabling canRotate we also benefit from the generic rotation rewrite rules. Benchmark on a StarFive VisionFive 2: │ rotate.1 │ rotate.2 │ │ sec/op │ sec/op vs base │ RotateLeft-4 14.700n ± 0% 8.016n ± 0% -45.47% (p=0.000 n=10) RotateLeft8-4 14.70n ± 0% 10.69n ± 0% -27.28% (p=0.000 n=10) RotateLeft16-4 14.70n ± 0% 12.02n ± 0% -18.23% (p=0.000 n=10) RotateLeft32-4 13.360n ± 0% 8.016n ± 0% -40.00% (p=0.000 n=10) RotateLeft64-4 13.360n ± 0% 8.016n ± 0% -40.00% (p=0.000 n=10) geomean 14.15n 9.208n -34.92% Change-Id: I1a2036fdc57cf88ebb6617eb8d92e1d187e183b2 Reviewed-on: https://go-review.googlesource.com/c/go/+/560315 Reviewed-by: M Zhuo <mengzhuo1203@gmail.com> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Mark Ryan <markdryan@rivosinc.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com>	2024-02-16 11:59:07 +00:00
Meng Zhuo	09ed9a6585	cmd/compile: implement float min/max in hardware for riscv64 CL 514596 adds float min/max for amd64, this CL adds it for riscv64. The behavior of the RISC-V FMIN/FMAX instructions almost match Go's requirements. However according to RISCV spec 8.3 "NaN Generation and Propagation" >> if at least one input is a signaling NaN, or if both inputs are quiet >> NaNs, the result is the canonical NaN. If one operand is a quiet NaN >> and the other is not a NaN, the result is the non-NaN operand. Go using quiet NaN as NaN and according to Go spec >> if any argument is a NaN, the result is a NaN This requires the float min/max implementation to check whether one of operand is qNaN before float mix/max actually execute. This CL also fix a typo in minmax test. Benchmark on Visionfive2 goos: linux goarch: riscv64 pkg: runtime │ float_minmax.old.bench │ float_minmax.new.bench │ │ sec/op │ sec/op vs base │ MinFloat 158.20n ± 0% 28.13n ± 0% -82.22% (p=0.000 n=10) MaxFloat 158.10n ± 0% 28.12n ± 0% -82.21% (p=0.000 n=10) geomean 158.1n 28.12n -82.22% Update #59488 Change-Id: Iab48be6d32b8882044fb8c821438ca8840e5493d Reviewed-on: https://go-review.googlesource.com/c/go/+/514775 Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Run-TryBot: M Zhuo <mengzhuo1203@gmail.com> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com>	2024-01-26 01:41:50 +00:00
Guoqi Chen	6b77d1b736	cmd/compile: update loong64 CALL* ops allow the loong64 CALL* ops to take variable number of args Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: I4706d9651fcbf9a0f201af6820c97b1a924f14e3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521781 Auto-Submit: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: David Chase <drchase@google.com>	2023-11-21 19:04:19 +00:00
Guoqi Chen	ebca52eeb7	cmd/compile/internal: add register info for loong64 regABI Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: Ifd7d94147b01e4fc83978b53dca2bcc0ad1ac4e3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521779 Reviewed-by: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn>	2023-11-21 19:04:14 +00:00
Guoqi Chen	070139a130	cmd/compile,cmd/internal,runtime: change registers on loong64 to avoid regABI arguments Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: Ic7e2e7fb4c1d3670e6abbfb817aa6e4e654e08d3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521777 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Auto-Submit: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: David Chase <drchase@google.com>	2023-11-21 17:59:37 +00:00
Guoqi Chen	f43581131e	cmd/compile, cmd/internal, runtime: change the registers used by the duff device for loong64 Add R21 to the allocatable registers, use R20 and R21 in duff device. This CL is in preparation for subsequent regABI support. Updates #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: If1661adc0f766925fbe74827a369797f95fa28a9 Reviewed-on: https://go-review.googlesource.com/c/go/+/521775 Reviewed-by: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-11-21 17:42:40 +00:00
Paul E. Murphy	773039ed5c	cmd/compile/internal/ssa: on PPC64, merge (CMPconst [0] (op ...)) more aggressively Generate the CC version of many opcodes whose result is compared against signed 0. The approach taken here works even if the opcode result is used in multiple places too. Add support for ADD, ADDconst, ANDN, SUB, NEG, CNTLZD, NOR conversions to their CC opcode variant. These are the most commonly used variants. Also, do not set clobberFlags of CNTLZD and CNTLZW, they do not clobber flags. This results in about 1% smaller text sections in kubernetes binaries, and no regressions in the crypto benchmarks. Change-Id: I9e0381944869c3774106bf348dead5ecb96dffda Reviewed-on: https://go-review.googlesource.com/c/go/+/538636 Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Jayanth Krishnamurthy <jayanth.krishnamurthy@ibm.com> Reviewed-by: Heschi Kreinick <heschi@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>	2023-11-13 22:12:32 +00:00
Keith Randall	962ccbef91	cmd/compile: ensure pointer arithmetic happens after the nil check Have nil checks return a pointer that is known non-nil. Users of that pointer can use the result, ensuring that they are ordered after the nil check itself. The order dependence goes away after scheduling, when we've fixed an order. At that point we move uses back to the original pointer so it doesn't change regalloc any. This prevents pointer arithmetic on nil from being spilled to the stack and then observed by a stack scan. Fixes #63657 Change-Id: I1a5fa4f2e6d9000d672792b4f90dfc1b7b67f6ea Reviewed-on: https://go-review.googlesource.com/c/go/+/537775 Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>	2023-10-31 20:45:54 +00:00
Ubuntu	8fc043ccfa	cmd/compile: optimize right shifts of int32 on riscv64 The compiler is currently sign extending 32 bit signed integers to 64 bits before right shifting them using a 64 bit shift instruction. There's no need to do this as RISC-V has instructions for right shifting 32 bit signed values (sraw and sraiw) which sign extend the result of the shift to 64 bits. Change the compiler so that it uses sraw and sraiw for shifts of signed 32 bit integers reducing in most cases the number of instructions needed to perform the shift. Here are some examples of code sequences that are changed by this patch: int32(a) >> 2 before: sll x5,x10,0x20 sra x10,x5,0x22 after: sraw x10,x10,0x2 int32(v) >> int(s) before: sext.w x5,x10 sltiu x6,x11,64 add x6,x6,-1 or x6,x11,x6 sra x10,x5,x6 after: sltiu x5,x11,32 add x5,x5,-1 or x5,x11,x5 sraw x10,x10,x5 int32(v) >> (int(s) & 31) before: sext.w x5,x10 and x6,x11,63 sra x10,x5,x6 after: and x5,x11,31 sraw x10,x10,x5 int32(100) >> int(a) before: bltz x10,<target address calls runtime.panicshift> sltiu x5,x10,64 add x5,x5,-1 or x5,x10,x5 li x6,100 sra x10,x6,x5 after: bltz x10,<target address calls runtime.panicshift> sltiu x5,x10,32 add x5,x5,-1 or x5,x10,x5 li x6,100 sraw x10,x6,x5 int32(v) >> (int(s) & 63) before: sext.w x5,x10 and x6,x11,63 sra x10,x5,x6 after: and x5,x11,63 sltiu x6,x5,32 add x6,x6,-1 or x5,x5,x6 sraw x10,x10,x5 In most cases we eliminate one instruction. In the case where we shift a int32 constant by a variable the number of instructions generated is identical. A sra is simply replaced by a sraw. In the unusual case where we shift right by a variable anded with a constant > 31 but < 64, we generate two additional instructions. As this is an unusual case we do not try to optimize for it. Some improvements can be seen in some of the existing benchmarks, notably in the utf8 package which performs right shifts of runes which are signed 32 bit integers. \| utf8-old \| utf8-new \| \| sec/op \| sec/op vs base \| EncodeASCIIRune-4 17.68n ± 0% 17.67n ± 0% ~ (p=0.312 n=10) EncodeJapaneseRune-4 35.34n ± 0% 34.53n ± 1% -2.31% (p=0.000 n=10) AppendASCIIRune-4 3.213n ± 0% 3.213n ± 0% ~ (p=0.318 n=10) AppendJapaneseRune-4 36.14n ± 0% 35.35n ± 0% -2.19% (p=0.000 n=10) DecodeASCIIRune-4 28.11n ± 0% 27.36n ± 0% -2.69% (p=0.000 n=10) DecodeJapaneseRune-4 38.55n ± 0% 38.58n ± 0% ~ (p=0.612 n=10) Change-Id: I60a91cbede9ce65597571c7b7dd9943eeb8d3cc2 Reviewed-on: https://go-review.googlesource.com/c/go/+/535115 Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: M Zhuo <mzh@golangcn.org> Reviewed-by: David Chase <drchase@google.com>	2023-10-30 14:47:06 +00:00
Guoqi Chen	3754ca0af2	cmd/compile: improve the implementation of Lowered{Move,Zero} on linux/loong64 Like the CL 487295, when implementing Lowered{Move,Zero}, 8 is first subtracted from Rarg0 (parameter Ptr), and then the offset of 8 is added during subsequent operations on Rarg0. This operation is meaningless, so delete it. Change LoweredMove's Rarg0 register to R20, consistent with duffcopy. goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3C5000 @ 2200.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Memmove/15 19.10n ± 0% 19.10n ± 0% ~ (p=0.483 n=15) MemmoveUnalignedDst/15 25.02n ± 0% 25.02n ± 0% ~ (p=0.741 n=15) MemmoveUnalignedDst/32 48.22n ± 0% 48.22n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedDst/64 90.57n ± 0% 90.52n ± 0% ~ (p=0.212 n=15) MemmoveUnalignedDstOverlap/32 44.12n ± 0% 44.13n ± 0% +0.02% (p=0.000 n=15) MemmoveUnalignedDstOverlap/64 87.79n ± 0% 87.80n ± 0% +0.01% (p=0.002 n=15) MemmoveUnalignedSrc/0 3.639n ± 0% 3.639n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedSrc/1 7.733n ± 0% 7.733n ± 0% ~ (p=1.000 n=15) MemmoveUnalignedSrc/2 9.097n ± 0% 9.097n ± 0% ~ (p=1.000 n=15) MemmoveUnalignedSrc/3 10.46n ± 0% 10.46n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedSrc/4 11.83n ± 0% 11.83n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedSrc/64 93.71n ± 0% 93.70n ± 0% ~ (p=0.128 n=15) Memclr/4096 699.1n ± 0% 699.1n ± 0% ~ (p=0.682 n=15) Memclr/65536 11.18µ ± 0% 11.18µ ± 0% -0.01% (p=0.000 n=15) Memclr/1M 175.2µ ± 0% 175.2µ ± 0% ~ (p=0.191 n=15) Memclr/4M 661.8µ ± 0% 662.0µ ± 0% ~ (p=0.486 n=15) MemclrUnaligned/4_5 19.39n ± 0% 20.47n ± 0% +5.57% (p=0.000 n=15) MemclrUnaligned/4_16 22.29n ± 0% 21.38n ± 0% -4.08% (p=0.000 n=15) MemclrUnaligned/4_64 30.58n ± 0% 29.81n ± 0% -2.52% (p=0.000 n=15) MemclrUnaligned/4_65536 11.19µ ± 0% 11.20µ ± 0% +0.02% (p=0.000 n=15) GoMemclr/5 12.73n ± 0% 12.73n ± 0% ~ (p=0.261 n=15) GoMemclr/16 10.01n ± 0% 10.00n ± 0% ~ (p=0.264 n=15) GoMemclr/256 50.94n ± 0% 50.94n ± 0% ~ (p=0.372 n=15) ClearFat15 14.95n ± 0% 15.01n ± 4% ~ (p=0.925 n=15) ClearFat1032 125.5n ± 0% 125.6n ± 0% +0.08% (p=0.000 n=15) CopyFat64 10.58n ± 0% 10.01n ± 0% -5.39% (p=0.000 n=15) CopyFat1040 244.3n ± 0% 155.6n ± 0% -36.31% (p=0.000 n=15) Issue18740/2byte 29.82µ ± 0% 29.82µ ± 0% ~ (p=0.648 n=30) Issue18740/4byte 18.18µ ± 0% 18.18µ ± 0% -0.02% (p=0.001 n=30) Issue18740/8byte 8.395µ ± 0% 8.395µ ± 0% ~ (p=0.401 n=30) geomean 154.5n 151.8n -1.70% ¹ all samples are equal Change-Id: Ia3f3c8b25e1e93c97ab72328651de78ca9dec016 Reviewed-on: https://go-review.googlesource.com/c/go/+/488515 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Auto-Submit: Ian Lance Taylor <iant@golang.org> Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: xiaodong liu <teaofmoli@gmail.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>	2023-10-20 00:01:44 +00:00
Joel Sing	f711892a8a	cmd/compile/internal: stop lowering OpConvert on riscv64 Lowering for OpConvert was removed for all architectures in CL#108496, prior to the riscv64 port being upstreamed. Remove lowering of OpConvert on riscv64, which brings it inline with all other architectures. This results in 1,600+ instructions being removed from the riscv64 go binary. Change-Id: Iaaf1f8b397875926604048b66ad8ac91a98c871e Reviewed-on: https://go-review.googlesource.com/c/go/+/533335 Run-TryBot: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com>	2023-10-07 12:31:59 +00:00
Mark Ryan	561bf0457f	cmd/compile: optimize right shifts of uint32 on riscv The compiler is currently zero extending 32 bit unsigned integers to 64 bits before right shifting them using a 64 bit shift instruction. There's no need to do this as RISC-V has instructions for right shifting 32 bit unsigned values (srlw and srliw) which zero extend the result of the shift to 64 bits. Change the compiler so that it uses srlw and srliw for 32 bit unsigned shifts reducing in most cases the number of instructions needed to perform the shift. Here are some examples of code sequences that are changed by this patch: uint32(a) >> 2 before: sll x5,x10,0x20 srl x10,x5,0x22 after: srlw x10,x10,0x2 uint32(a) >> int(b) before: sll x5,x10,0x20 srl x5,x5,0x20 srl x5,x5,x11 sltiu x6,x11,64 neg x6,x6 and x10,x5,x6 after: srlw x5,x10,x11 sltiu x6,x11,32 neg x6,x6 and x10,x5,x6 bits.RotateLeft32(uint32(a), 1) before: sll x5,x10,0x1 sll x6,x10,0x20 srl x7,x6,0x3f or x5,x5,x7 after: sll x5,x10,0x1 srlw x6,x10,0x1f or x10,x5,x6 bits.RotateLeft32(uint32(a), int(b)) before: and x6,x11,31 sll x7,x10,x6 sll x8,x10,0x20 srl x8,x8,0x20 add x6,x6,-32 neg x6,x6 srl x9,x8,x6 sltiu x6,x6,64 neg x6,x6 and x6,x9,x6 or x6,x6,x7 after: and x5,x11,31 sll x6,x10,x5 add x5,x5,-32 neg x5,x5 srlw x7,x10,x5 sltiu x5,x5,32 neg x5,x5 and x5,x7,x5 or x10,x6,x5 The one regression observed is the following case, an unbounded right shift of a uint32 where the value we're shifting by is known to be < 64 but > 31. As this is an unusual case this commit does not optimize for it, although the existing code does. uint32(a) >> (b & 63) before: sll x5,x10,0x20 srl x5,x5,0x20 and x6,x11,63 srl x10,x5,x6 after and x5,x11,63 srlw x6,x10,x5 sltiu x5,x5,32 neg x5,x5 and x10,x6,x5 Here we have one extra instruction. Some benchmark highlights, generated on a VisionFive2 8GB running Ubuntu 23.04. pkg: math/bits LeadingZeros32-4 18.64n ± 0% 17.32n ± 0% -7.11% (p=0.000 n=10) LeadingZeros64-4 15.47n ± 0% 15.51n ± 0% +0.26% (p=0.027 n=10) TrailingZeros16-4 18.48n ± 0% 17.68n ± 0% -4.33% (p=0.000 n=10) TrailingZeros32-4 16.87n ± 0% 16.07n ± 0% -4.74% (p=0.000 n=10) TrailingZeros64-4 15.26n ± 0% 15.27n ± 0% +0.07% (p=0.043 n=10) OnesCount32-4 20.08n ± 0% 19.29n ± 0% -3.96% (p=0.000 n=10) RotateLeft-4 8.864n ± 0% 8.838n ± 0% -0.30% (p=0.006 n=10) RotateLeft32-4 8.837n ± 0% 8.032n ± 0% -9.11% (p=0.000 n=10) Reverse32-4 29.77n ± 0% 26.52n ± 0% -10.93% (p=0.000 n=10) ReverseBytes32-4 9.640n ± 0% 8.838n ± 0% -8.32% (p=0.000 n=10) Sub32-4 8.835n ± 0% 8.035n ± 0% -9.06% (p=0.000 n=10) geomean 11.50n 11.33n -1.45% pkg: crypto/md5 Hash8Bytes-4 1.486µ ± 0% 1.426µ ± 0% -4.04% (p=0.000 n=10) Hash64-4 2.079µ ± 0% 1.968µ ± 0% -5.36% (p=0.000 n=10) Hash128-4 2.720µ ± 0% 2.557µ ± 0% -5.99% (p=0.000 n=10) Hash256-4 3.996µ ± 0% 3.733µ ± 0% -6.58% (p=0.000 n=10) Hash512-4 6.541µ ± 0% 6.072µ ± 0% -7.18% (p=0.000 n=10) Hash1K-4 11.64µ ± 0% 10.75µ ± 0% -7.58% (p=0.000 n=10) Hash8K-4 82.95µ ± 0% 76.32µ ± 0% -7.99% (p=0.000 n=10) Hash1M-4 10.436m ± 0% 9.591m ± 0% -8.10% (p=0.000 n=10) Hash8M-4 83.50m ± 0% 76.73m ± 0% -8.10% (p=0.000 n=10) Hash8BytesUnaligned-4 1.494µ ± 0% 1.434µ ± 0% -4.02% (p=0.000 n=10) Hash1KUnaligned-4 11.64µ ± 0% 10.76µ ± 0% -7.52% (p=0.000 n=10) Hash8KUnaligned-4 83.01µ ± 0% 76.32µ ± 0% -8.07% (p=0.000 n=10) geomean 28.32µ 26.42µ -6.72% Change-Id: I20483a6668cca1b53fe83944bee3706aadcf8693 Reviewed-on: https://go-review.googlesource.com/c/go/+/528975 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-10-07 12:31:38 +00:00
Xianmiao Qu	d98f74b31e	cmd/compile/internal: intrinsify publicationBarrier on riscv64 This enables publicationBarrier to be used as an intrinsic on riscv64, optimizing the required function call and return instructions for invoking the "runtime.publicationBarrier" function. This function is called by mallocgc. The benchmark results for malloc tested on Lichee-Pi-4A(TH1520, RISC-V 2.0G C910 x4) are as follows. goos: linux goarch: riscv64 pkg: runtime │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Malloc8-4 92.78n ± 1% 90.77n ± 1% -2.17% (p=0.001 n=10) Malloc16-4 156.5n ± 1% 151.7n ± 2% -3.10% (p=0.000 n=10) MallocTypeInfo8-4 131.7n ± 1% 130.6n ± 2% ~ (p=0.165 n=10) MallocTypeInfo16-4 186.5n ± 2% 186.2n ± 1% ~ (p=0.956 n=10) MallocLargeStruct-4 1.345µ ± 1% 1.355µ ± 1% ~ (p=0.093 n=10) geomean 216.9n 214.5n -1.10% Change-Id: Ieab6c02309614bac5c1b12b5ee3311f988ff644d Reviewed-on: https://go-review.googlesource.com/c/go/+/531719 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: M Zhuo <mzh@golangcn.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Joel Sing <joel@sing.id.au>	2023-10-03 19:29:38 +00:00
Guoqi Chen	06f420fc19	runtime: remove the meaningless offset of 8 for duffzero on loong64 Currently we subtract 8 from offset when calling duffzero because 8 is added to offset in the duffzero implementation. This operation is meaningless, so remove it. Change-Id: I7e451d04d7e98ccafe711645d81d3aadf376766f Reviewed-on: https://go-review.googlesource.com/c/go/+/487295 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: WANG Xuerui <git@xen0n.name> Run-TryBot: WANG Xuerui <git@xen0n.name> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: xiaodong liu <teaofmoli@gmail.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Ian Lance Taylor <iant@golang.org>	2023-09-01 15:48:45 +00:00
Meng Zhuo	63ab68ddc5	cmd/compile: add single-precision FMA code generation for riscv64 This CL adds FMADDS,FMSUBS,FNMADDS,FNMSUBS SSA support for riscv Change-Id: I1e7dd322b46b9e0f4923dbba256303d69ed12066 Reviewed-on: https://go-review.googlesource.com/c/go/+/506616 Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: M Zhuo <mzh@golangcn.org>	2023-08-22 12:05:36 +00:00
Meng Zhuo	05f9511582	cmd/compile: improve FP FMA performance on riscv64 FMADD/FMSUB/FNSUB are an efficient FP FMA instructions, which can be used by the compiler to improve FP performance. Erf 188.0n ± 2% 139.5n ± 2% -25.82% (p=0.000 n=10) Erfc 193.6n ± 1% 143.2n ± 1% -26.01% (p=0.000 n=10) Erfinv 244.4n ± 2% 172.6n ± 0% -29.40% (p=0.000 n=10) Erfcinv 244.7n ± 2% 173.0n ± 1% -29.31% (p=0.000 n=10) geomean 216.0n 156.3n -27.65% Ref: The RISC-V Instruction Set Manual Volume I: Unprivileged ISA 11.6 Single-Precision Floating-Point Computational Instructions Change-Id: I89aa3a4df7576fdd47f4a6ee608ac16feafd093c Reviewed-on: https://go-review.googlesource.com/c/go/+/506036 Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: M Zhuo <mzh@golangcn.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-08-22 08:38:08 +00:00
Paul E. Murphy	41c71d48a1	cmd/compile/internal: add RLDICR opcode for PPC64 This is encoded similarly to RLDICL, but can clear the least significant bits. Likewise, update the auxint encoding of RLDICL to match those used by the rotate and mask word ssa opcodes for easier usage within lowering rules. The RLDICL ssa opcode is not used yet. Change-Id: I42486dd95714a3e8e2f19ab237a6cf3af520c905 Reviewed-on: https://go-review.googlesource.com/c/go/+/515575 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Paul Murphy <murp@ibm.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>	2023-08-14 20:31:17 +00:00
Keith Randall	611706b171	cmd/compile: don't use BTS when OR works, add direct memory BTS operations Stop using BTSconst and friends when ORLconst can be used instead. OR can be issued by more function units than BTS can, so it could lead to better IPC. OR might take a few more bytes to encode, but not a lot more. Still use BTSconst for cases where the constant otherwise wouldn't fit and would require a separate movabs instruction to materialize the constant. This happens when setting bits 31-63 of 64-bit targets. Add BTS-to-memory operations so we don't need to load/bts/store. Fixes #61694 Change-Id: I00379608df8fb0167cb01466e97d11dec7c1596c Reviewed-on: https://go-review.googlesource.com/c/go/+/515755 Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>	2023-08-04 16:40:24 +00:00
Keith Randall	319504ce43	cmd/compile: implement float min/max in hardware for amd64 and arm64 Update #59488 Change-Id: I89f5ea494cbcc887f6fae8560e57bcbd8749be86 Reviewed-on: https://go-review.googlesource.com/c/go/+/514596 Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>	2023-08-01 20:03:31 +00:00
Keith Randall	67983c0f78	cmd/compile: add indexed SET* opcodes for amd64 Update #61356 Change-Id: I391af98563b1c068208784c80ea736c78c29639d Reviewed-on: https://go-review.googlesource.com/c/go/+/510435 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: Martin Möhrmann <martin@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com>	2023-07-26 17:19:57 +00:00
Junxian Zhu	d9fd19a7f5	cmd/compile: optimize math.Float32bits and math.Float32frombits on mipsx This CL use MFC1/MTC1 instructions to move data between GPR and FPR instead of stores and loads to move float/int values. goos: linux goarch: mipsle pkg: math │ oldmathf │ newmathf │ │ sec/op │ sec/op vs base │ Acos-4 282.7n ± 0% 282.1n ± 0% -0.18% (p=0.010 n=8) Acosh-4 450.8n ± 0% 450.9n ± 0% ~ (p=0.699 n=8) Asin-4 272.6n ± 0% 272.1n ± 0% ~ (p=0.050 n=8) Asinh-4 476.8n ± 0% 475.1n ± 0% -0.35% (p=0.018 n=8) Atan-4 208.1n ± 0% 207.7n ± 0% -0.17% (p=0.009 n=8) Atanh-4 448.8n ± 0% 448.7n ± 0% -0.03% (p=0.014 n=8) Atan2-4 310.2n ± 0% 310.1n ± 0% ~ (p=0.133 n=8) Cbrt-4 357.9n ± 0% 358.4n ± 0% +0.11% (p=0.014 n=8) Ceil-4 203.8n ± 0% 204.7n ± 0% +0.42% (p=0.008 n=8) Compare-4 21.12n ± 0% 22.09n ± 0% +4.59% (p=0.000 n=8) Compare32-4 19.105n ± 0% 6.022n ± 0% -68.48% (p=0.000 n=8) Copysign-4 33.17n ± 0% 33.15n ± 0% ~ (p=0.795 n=8) Cos-4 385.2n ± 0% 384.8n ± 1% ~ (p=0.112 n=8) Cosh-4 546.0n ± 0% 545.0n ± 0% -0.17% (p=0.012 n=8) Erf-4 192.4n ± 0% 195.4n ± 1% +1.59% (p=0.000 n=8) Erfc-4 187.8n ± 0% 192.7n ± 0% +2.64% (p=0.000 n=8) Erfinv-4 221.8n ± 1% 219.8n ± 0% -0.88% (p=0.000 n=8) Erfcinv-4 224.1n ± 1% 219.9n ± 0% -1.87% (p=0.000 n=8) Exp-4 434.7n ± 0% 435.0n ± 0% ~ (p=0.339 n=8) ExpGo-4 433.7n ± 0% 434.2n ± 0% +0.13% (p=0.005 n=8) Expm1-4 243.0n ± 0% 242.9n ± 0% ~ (p=0.103 n=8) Exp2-4 426.6n ± 0% 426.6n ± 0% ~ (p=0.822 n=8) Exp2Go-4 425.6n ± 0% 425.5n ± 0% ~ (p=0.377 n=8) Abs-4 8.033n ± 0% 8.029n ± 0% ~ (p=0.065 n=8) Dim-4 18.07n ± 0% 18.07n ± 0% ~ (p=0.051 n=8) Floor-4 151.6n ± 0% 151.6n ± 0% ~ (p=0.450 n=8) Max-4 100.9n ± 8% 103.2n ± 2% ~ (p=0.099 n=8) Min-4 116.4n ± 0% 116.4n ± 0% ~ (p=0.467 n=8) Mod-4 959.6n ± 1% 950.9n ± 0% -0.91% (p=0.006 n=8) Frexp-4 147.6n ± 0% 147.5n ± 0% -0.07% (p=0.026 n=8) Gamma-4 482.7n ± 0% 478.2n ± 2% -0.92% (p=0.000 n=8) Hypot-4 139.8n ± 1% 127.1n ± 8% -9.12% (p=0.000 n=8) HypotGo-4 137.2n ± 7% 117.5n ± 2% -14.39% (p=0.001 n=8) Ilogb-4 109.5n ± 0% 108.4n ± 1% -1.05% (p=0.001 n=8) J0-4 1.304µ ± 0% 1.304µ ± 0% ~ (p=0.853 n=8) J1-4 1.349µ ± 0% 1.331µ ± 0% -1.33% (p=0.000 n=8) Jn-4 2.774µ ± 0% 2.750µ ± 0% -0.87% (p=0.000 n=8) Ldexp-4 151.6n ± 0% 151.5n ± 0% ~ (p=0.695 n=8) Lgamma-4 226.9n ± 0% 233.9n ± 0% +3.09% (p=0.000 n=8) Log-4 407.6n ± 0% 407.4n ± 0% ~ (p=0.340 n=8) Logb-4 121.5n ± 0% 121.5n ± 0% -0.08% (p=0.042 n=8) Log1p-4 315.5n ± 0% 315.6n ± 0% ~ (p=0.930 n=8) Log10-4 417.8n ± 0% 417.5n ± 0% ~ (p=0.053 n=8) Log2-4 208.8n ± 0% 208.8n ± 0% ~ (p=0.582 n=8) Modf-4 126.5n ± 0% 126.4n ± 0% ~ (p=0.128 n=8) Nextafter32-4 112.45n ± 0% 82.27n ± 0% -26.84% (p=0.000 n=8) Nextafter64-4 141.5n ± 0% 141.5n ± 0% ~ (p=0.569 n=8) PowInt-4 754.0n ± 1% 754.6n ± 0% ~ (p=0.279 n=8) PowFrac-4 1.608µ ± 1% 1.596µ ± 1% ~ (p=0.661 n=8) Pow10Pos-4 18.07n ± 0% 18.07n ± 0% ~ (p=0.413 n=8) Pow10Neg-4 17.08n ± 0% 18.07n ± 0% +5.80% (p=0.000 n=8) Round-4 68.30n ± 0% 69.29n ± 0% +1.45% (p=0.000 n=8) RoundToEven-4 78.33n ± 0% 78.34n ± 0% ~ (p=0.975 n=8) Remainder-4 740.6n ± 1% 736.7n ± 0% ~ (p=0.098 n=8) Signbit-4 18.08n ± 0% 18.07n ± 0% ~ (p=0.546 n=8) Sin-4 389.4n ± 0% 389.5n ± 0% ~ (p=0.451 n=8) Sincos-4 415.6n ± 0% 415.6n ± 0% ~ (p=0.450 n=8) Sinh-4 607.0n ± 0% 590.8n ± 1% -2.68% (p=0.000 n=8) SqrtIndirect-4 8.034n ± 0% 8.030n ± 0% ~ (p=0.487 n=8) SqrtLatency-4 8.031n ± 0% 8.034n ± 0% ~ (p=0.152 n=8) SqrtIndirectLatency-4 8.032n ± 0% 8.032n ± 0% ~ (p=0.818 n=8) SqrtGoLatency-4 895.8n ± 0% 895.3n ± 0% ~ (p=0.553 n=8) SqrtPrime-4 5.405µ ± 0% 5.379µ ± 0% -0.48% (p=0.000 n=8) Tan-4 405.6n ± 0% 405.7n ± 0% ~ (p=0.980 n=8) Tanh-4 545.1n ± 0% 545.1n ± 0% ~ (p=0.806 n=8) Trunc-4 146.5n ± 0% 146.6n ± 0% ~ (p=0.380 n=8) Y0-4 1.308µ ± 0% 1.306µ ± 0% ~ (p=0.071 n=8) Y1-4 1.311µ ± 0% 1.315µ ± 0% +0.31% (p=0.000 n=8) Yn-4 2.737µ ± 0% 2.745µ ± 0% +0.27% (p=0.000 n=8) Float64bits-4 14.56n ± 0% 14.56n ± 0% ~ (p=0.689 n=8) Float64frombits-4 19.08n ± 0% 19.08n ± 0% ~ (p=0.580 n=8) Float32bits-4 13.050n ± 0% 5.019n ± 0% -61.54% (p=0.000 n=8) Float32frombits-4 13.060n ± 0% 4.016n ± 0% -69.25% (p=0.000 n=8) FMA-4 608.5n ± 0% 586.1n ± 0% -3.67% (p=0.000 n=8) geomean 185.5n 176.2n -5.02% Change-Id: Ibf91092ffe70104e6c5ec03bc76d51259818b9b3 Reviewed-on: https://go-review.googlesource.com/c/go/+/494535 Run-TryBot: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2023-05-24 14:43:03 +00:00
Junxian Zhu	f0d575c266	cmd/compile: optimize math.Float64(32)bits and math.Float64(32)frombits on mips64x This CL use MFC1/MTC1 instructions to move data between GPR and FPR instead of stores and loads to move float/int values. goos: linux goarch: mips64le pkg: math │ oldmath │ newmath │ │ sec/op │ sec/op vs base │ Acos-4 258.2n ± 0% 258.2n ± 0% ~ (p=0.859 n=8) Acosh-4 378.7n ± 0% 323.9n ± 0% -14.47% (p=0.000 n=8) Asin-4 255.1n ± 2% 255.5n ± 0% +0.16% (p=0.002 n=8) Asinh-4 407.1n ± 0% 348.7n ± 0% -14.35% (p=0.000 n=8) Atan-4 189.5n ± 0% 189.9n ± 3% ~ (p=0.205 n=8) Atanh-4 355.6n ± 0% 323.4n ± 2% -9.03% (p=0.000 n=8) Atan2-4 284.1n ± 7% 280.1n ± 4% ~ (p=0.313 n=8) Cbrt-4 314.3n ± 0% 236.4n ± 0% -24.79% (p=0.000 n=8) Ceil-4 144.3n ± 3% 139.6n ± 0% ~ (p=0.069 n=8) Compare-4 21.100n ± 0% 7.035n ± 0% -66.66% (p=0.000 n=8) Compare32-4 20.100n ± 0% 6.030n ± 0% -70.00% (p=0.000 n=8) Copysign-4 34.970n ± 0% 6.221n ± 0% -82.21% (p=0.000 n=8) Cos-4 183.4n ± 3% 184.1n ± 5% ~ (p=0.159 n=8) Cosh-4 487.9n ± 2% 419.6n ± 0% -14.00% (p=0.000 n=8) Erf-4 160.6n ± 0% 157.9n ± 0% -1.68% (p=0.009 n=8) Erfc-4 183.7n ± 4% 169.8n ± 0% -7.54% (p=0.000 n=8) Erfinv-4 191.5n ± 4% 183.6n ± 0% -4.13% (p=0.023 n=8) Erfcinv-4 192.0n ± 7% 184.3n ± 0% ~ (p=0.425 n=8) Exp-4 398.2n ± 0% 340.1n ± 4% -14.58% (p=0.000 n=8) ExpGo-4 383.3n ± 0% 327.3n ± 0% -14.62% (p=0.000 n=8) Expm1-4 248.7n ± 5% 216.0n ± 0% -13.11% (p=0.000 n=8) Exp2-4 372.8n ± 0% 316.9n ± 3% -14.98% (p=0.000 n=8) Exp2Go-4 374.1n ± 0% 320.5n ± 0% -14.33% (p=0.000 n=8) Abs-4 3.013n ± 0% 3.016n ± 0% +0.10% (p=0.020 n=8) Dim-4 5.021n ± 0% 5.022n ± 0% ~ (p=0.270 n=8) Floor-4 127.5n ± 4% 126.2n ± 3% ~ (p=0.186 n=8) Max-4 72.32n ± 0% 61.33n ± 0% -15.20% (p=0.000 n=8) Min-4 83.33n ± 1% 61.36n ± 0% -26.37% (p=0.000 n=8) Mod-4 690.7n ± 0% 454.5n ± 0% -34.20% (p=0.000 n=8) Frexp-4 116.30n ± 1% 71.80n ± 1% -38.26% (p=0.000 n=8) Gamma-4 389.0n ± 0% 355.9n ± 1% -8.48% (p=0.000 n=8) Hypot-4 102.40n ± 0% 83.90n ± 0% -18.07% (p=0.000 n=8) HypotGo-4 105.45n ± 4% 84.82n ± 2% -19.56% (p=0.000 n=8) Ilogb-4 99.13n ± 4% 63.71n ± 2% -35.73% (p=0.000 n=8) J0-4 859.7n ± 0% 854.8n ± 0% -0.57% (p=0.000 n=8) J1-4 873.9n ± 0% 875.7n ± 0% +0.21% (p=0.007 n=8) Jn-4 1.855µ ± 0% 1.867µ ± 0% +0.65% (p=0.000 n=8) Ldexp-4 130.50n ± 2% 64.35n ± 0% -50.69% (p=0.000 n=8) Lgamma-4 208.8n ± 0% 200.9n ± 0% -3.78% (p=0.000 n=8) Log-4 294.1n ± 0% 255.2n ± 3% -13.22% (p=0.000 n=8) Logb-4 105.45n ± 1% 66.81n ± 1% -36.64% (p=0.000 n=8) Log1p-4 268.2n ± 0% 211.3n ± 0% -21.21% (p=0.000 n=8) Log10-4 295.4n ± 0% 255.2n ± 2% -13.59% (p=0.000 n=8) Log2-4 152.9n ± 1% 127.5n ± 0% -16.61% (p=0.000 n=8) Modf-4 103.40n ± 0% 75.36n ± 0% -27.12% (p=0.000 n=8) Nextafter32-4 121.20n ± 1% 78.40n ± 0% -35.31% (p=0.000 n=8) Nextafter64-4 110.40n ± 1% 64.91n ± 0% -41.20% (p=0.000 n=8) PowInt-4 509.8n ± 1% 369.3n ± 1% -27.56% (p=0.000 n=8) PowFrac-4 1189.0n ± 0% 947.8n ± 0% -20.29% (p=0.000 n=8) Pow10Pos-4 15.07n ± 0% 15.07n ± 0% ~ (p=0.733 n=8) Pow10Neg-4 20.10n ± 0% 20.10n ± 0% ~ (p=0.576 n=8) Round-4 44.22n ± 0% 26.12n ± 0% -40.92% (p=0.000 n=8) RoundToEven-4 46.22n ± 0% 27.12n ± 0% -41.31% (p=0.000 n=8) Remainder-4 539.0n ± 1% 417.1n ± 1% -22.62% (p=0.000 n=8) Signbit-4 17.985n ± 0% 5.694n ± 0% -68.34% (p=0.000 n=8) Sin-4 185.7n ± 5% 172.9n ± 0% -6.89% (p=0.001 n=8) Sincos-4 176.6n ± 0% 200.9n ± 0% +13.76% (p=0.000 n=8) Sinh-4 495.8n ± 0% 435.9n ± 0% -12.09% (p=0.000 n=8) SqrtIndirect-4 5.022n ± 0% 5.024n ± 0% ~ (p=0.083 n=8) SqrtLatency-4 8.038n ± 0% 8.044n ± 0% ~ (p=0.524 n=8) SqrtIndirectLatency-4 8.035n ± 0% 8.039n ± 0% +0.06% (p=0.017 n=8) SqrtGoLatency-4 340.1n ± 0% 278.3n ± 0% -18.19% (p=0.000 n=8) SqrtPrime-4 5.381µ ± 0% 5.386µ ± 0% ~ (p=0.662 n=8) Tan-4 198.6n ± 1% 183.1n ± 0% -7.85% (p=0.000 n=8) Tanh-4 491.3n ± 1% 440.8n ± 1% -10.29% (p=0.000 n=8) Trunc-4 121.7n ± 0% 121.7n ± 0% ~ (p=0.769 n=8) Y0-4 855.1n ± 0% 859.8n ± 0% +0.54% (p=0.007 n=8) Y1-4 862.3n ± 0% 865.1n ± 0% +0.32% (p=0.007 n=8) Yn-4 1.830µ ± 0% 1.837µ ± 0% +0.36% (p=0.011 n=8) Float64bits-4 13.060n ± 0% 3.016n ± 0% -76.91% (p=0.000 n=8) Float64frombits-4 13.060n ± 0% 3.018n ± 0% -76.90% (p=0.000 n=8) Float32bits-4 13.060n ± 0% 3.016n ± 0% -76.91% (p=0.000 n=8) Float32frombits-4 13.070n ± 0% 3.013n ± 0% -76.94% (p=0.000 n=8) FMA-4 446.0n ± 0% 413.1n ± 1% -7.38% (p=0.000 n=8) geomean 143.4n 108.3n -24.49% Change-Id: I2067f7a5ae1126ada7ab3fb2083710e8212535e9 Reviewed-on: https://go-review.googlesource.com/c/go/+/493815 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>	2023-05-24 03:36:31 +00:00
Junxian Zhu	75add1ce0e	cmd/compile: intrinsify runtime/internal/atomic.{And,Or} on MIPS64x This CL intrinsify atomic{And,Or} on mips64x, which already implemented on mipsx. goos: linux goarch: mips64le pkg: runtime/internal/atomic _ oldatomic _ newatomic _ _ sec/op _ sec/op vs base _ AtomicLoad64-4 27.96n _ 0% 28.02n _ 0% +0.20% (p=0.026 n=8) AtomicStore64-4 29.14n _ 0% 29.21n _ 0% +0.22% (p=0.004 n=8) AtomicLoad-4 27.96n _ 0% 28.02n _ 0% ~ (p=0.220 n=8) AtomicStore-4 29.15n _ 0% 29.21n _ 0% +0.19% (p=0.002 n=8) And8-4 53.09n _ 0% 41.71n _ 0% -21.44% (p=0.000 n=8) And-4 49.87n _ 0% 39.93n _ 0% -19.93% (p=0.000 n=8) And8Parallel-4 70.45n _ 0% 68.58n _ 0% -2.65% (p=0.000 n=8) AndParallel-4 70.40n _ 0% 67.95n _ 0% -3.47% (p=0.000 n=8) Or8-4 52.09n _ 0% 41.11n _ 0% -21.08% (p=0.000 n=8) Or-4 49.80n _ 0% 39.87n _ 0% -19.93% (p=0.000 n=8) Or8Parallel-4 70.43n _ 0% 68.25n _ 0% -3.08% (p=0.000 n=8) OrParallel-4 70.42n _ 0% 67.94n _ 0% -3.51% (p=0.000 n=8) Xadd-4 67.83n _ 0% 67.92n _ 0% +0.13% (p=0.003 n=8) Xadd64-4 67.85n _ 0% 67.92n _ 0% +0.09% (p=0.021 n=8) Cas-4 81.34n _ 0% 81.37n _ 0% ~ (p=0.859 n=8) Cas64-4 81.43n _ 0% 81.53n _ 0% +0.13% (p=0.001 n=8) Xchg-4 67.15n _ 0% 67.18n _ 0% ~ (p=0.367 n=8) Xchg64-4 67.16n _ 0% 67.21n _ 0% +0.08% (p=0.008 n=8) geomean 54.04n 51.01n -5.61% Change-Id: I9a4353f4b14134f1e9cf0dcf99db3feb951328ed Reviewed-on: https://go-review.googlesource.com/c/go/+/494875 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Joel Sing <joel@sing.id.au> Reviewed-by: Junxian Zhu <zhujunxian@oss.cipunited.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-05-18 10:23:17 +00:00
Junxian Zhu	5cad8d41ca	math: optimize math.Abs on mipsx This commit optimized math.Abs function implementation on mipsx. Tested on loongson 3A2000. goos: linux goarch: mipsle pkg: math │ oldmath │ newmath │ │ sec/op │ sec/op vs base │ Acos-4 282.6n ± 0% 282.3n ± 0% ~ (p=0.140 n=7) Acosh-4 506.1n ± 0% 451.8n ± 0% -10.73% (p=0.001 n=7) Asin-4 272.3n ± 0% 272.2n ± 0% ~ (p=0.808 n=7) Asinh-4 529.7n ± 0% 475.3n ± 0% -10.27% (p=0.001 n=7) Atan-4 208.2n ± 0% 207.9n ± 0% ~ (p=0.134 n=7) Atanh-4 503.4n ± 1% 449.7n ± 0% -10.67% (p=0.001 n=7) Atan2-4 310.5n ± 0% 310.5n ± 0% ~ (p=0.928 n=7) Cbrt-4 359.3n ± 0% 358.8n ± 0% ~ (p=0.121 n=7) Ceil-4 203.9n ± 0% 204.0n ± 0% ~ (p=0.600 n=7) Compare-4 23.11n ± 0% 23.11n ± 0% ~ (p=0.702 n=7) Compare32-4 19.09n ± 0% 19.12n ± 0% ~ (p=0.070 n=7) Copysign-4 33.20n ± 0% 34.02n ± 0% +2.47% (p=0.001 n=7) Cos-4 422.5n ± 0% 385.4n ± 1% -8.78% (p=0.001 n=7) Cosh-4 628.0n ± 0% 545.5n ± 0% -13.14% (p=0.001 n=7) Erf-4 193.7n ± 2% 192.7n ± 1% ~ (p=0.430 n=7) Erfc-4 192.8n ± 1% 193.0n ± 0% ~ (p=0.245 n=7) Erfinv-4 220.7n ± 1% 221.5n ± 2% ~ (p=0.272 n=7) Erfcinv-4 221.3n ± 1% 220.4n ± 2% ~ (p=0.738 n=7) Exp-4 471.4n ± 0% 435.1n ± 0% -7.70% (p=0.001 n=7) ExpGo-4 470.6n ± 0% 434.0n ± 0% -7.78% (p=0.001 n=7) Expm1-4 243.1n ± 0% 243.4n ± 0% ~ (p=0.417 n=7) Exp2-4 463.1n ± 0% 427.0n ± 0% -7.80% (p=0.001 n=7) Exp2Go-4 462.4n ± 0% 426.2n ± 5% -7.83% (p=0.001 n=7) Abs-4 37.000n ± 0% 8.039n ± 9% -78.27% (p=0.001 n=7) Dim-4 18.09n ± 0% 18.11n ± 0% ~ (p=0.094 n=7) Floor-4 151.9n ± 0% 151.8n ± 0% ~ (p=0.190 n=7) Max-4 116.7n ± 1% 116.7n ± 1% ~ (p=0.842 n=7) Min-4 116.6n ± 1% 116.6n ± 0% ~ (p=0.464 n=7) Mod-4 1244.0n ± 0% 980.9n ± 0% -21.15% (p=0.001 n=7) Frexp-4 199.0n ± 0% 146.7n ± 0% -26.28% (p=0.001 n=7) Gamma-4 516.4n ± 0% 479.3n ± 1% -7.18% (p=0.001 n=7) Hypot-4 169.8n ± 0% 117.8n ± 2% -30.62% (p=0.001 n=7) HypotGo-4 170.8n ± 0% 117.5n ± 0% -31.21% (p=0.001 n=7) Ilogb-4 160.8n ± 0% 109.5n ± 0% -31.90% (p=0.001 n=7) J0-4 1.359µ ± 0% 1.305µ ± 0% -3.97% (p=0.001 n=7) J1-4 1.386µ ± 0% 1.334µ ± 0% -3.75% (p=0.001 n=7) Jn-4 2.864µ ± 0% 2.758µ ± 0% -3.70% (p=0.001 n=7) Ldexp-4 202.9n ± 0% 151.7n ± 0% -25.23% (p=0.001 n=7) Lgamma-4 234.0n ± 0% 234.3n ± 0% ~ (p=0.199 n=7) Log-4 444.1n ± 0% 407.9n ± 0% -8.15% (p=0.001 n=7) Logb-4 157.8n ± 0% 121.6n ± 0% -22.94% (p=0.001 n=7) Log1p-4 354.8n ± 0% 315.4n ± 0% -11.10% (p=0.001 n=7) Log10-4 453.9n ± 0% 417.9n ± 0% -7.93% (p=0.001 n=7) Log2-4 245.3n ± 0% 209.1n ± 0% -14.76% (p=0.001 n=7) Modf-4 126.6n ± 0% 126.6n ± 0% ~ (p=0.126 n=7) Nextafter32-4 112.5n ± 0% 112.5n ± 0% ~ (p=0.853 n=7) Nextafter64-4 141.7n ± 0% 141.6n ± 0% ~ (p=0.331 n=7) PowInt-4 878.8n ± 1% 758.3n ± 1% -13.71% (p=0.001 n=7) PowFrac-4 1.809µ ± 0% 1.615µ ± 0% -10.72% (p=0.001 n=7) Pow10Pos-4 18.10n ± 0% 18.12n ± 0% ~ (p=0.464 n=7) Pow10Neg-4 17.09n ± 0% 17.09n ± 0% ~ (p=0.263 n=7) Round-4 68.36n ± 0% 68.33n ± 0% ~ (p=0.325 n=7) RoundToEven-4 78.40n ± 0% 78.40n ± 0% ~ (p=0.934 n=7) Remainder-4 894.0n ± 1% 753.4n ± 1% -15.73% (p=0.001 n=7) Signbit-4 18.09n ± 0% 18.09n ± 0% ~ (p=0.761 n=7) Sin-4 389.8n ± 1% 389.8n ± 0% ~ (p=0.995 n=7) Sincos-4 416.0n ± 0% 415.9n ± 0% ~ (p=0.361 n=7) Sinh-4 634.6n ± 4% 585.6n ± 1% -7.72% (p=0.001 n=7) SqrtIndirect-4 8.035n ± 0% 8.036n ± 0% ~ (p=0.523 n=7) SqrtLatency-4 8.039n ± 0% 8.037n ± 0% ~ (p=0.218 n=7) SqrtIndirectLatency-4 8.040n ± 0% 8.040n ± 0% ~ (p=0.652 n=7) SqrtGoLatency-4 895.7n ± 0% 896.6n ± 0% +0.10% (p=0.004 n=7) SqrtPrime-4 5.406µ ± 0% 5.407µ ± 0% ~ (p=0.592 n=7) Tan-4 406.1n ± 0% 405.8n ± 1% ~ (p=0.435 n=7) Tanh-4 627.6n ± 0% 545.5n ± 0% -13.08% (p=0.001 n=7) Trunc-4 146.7n ± 1% 146.7n ± 0% ~ (p=0.755 n=7) Y0-4 1.359µ ± 0% 1.310µ ± 0% -3.61% (p=0.001 n=7) Y1-4 1.351µ ± 0% 1.301µ ± 0% -3.70% (p=0.001 n=7) Yn-4 2.829µ ± 0% 2.729µ ± 0% -3.53% (p=0.001 n=7) Float64bits-4 14.08n ± 0% 14.07n ± 0% ~ (p=0.069 n=7) Float64frombits-4 19.09n ± 0% 19.10n ± 0% ~ (p=0.755 n=7) Float32bits-4 13.06n ± 0% 13.07n ± 1% ~ (p=0.586 n=7) Float32frombits-4 13.06n ± 0% 13.06n ± 0% ~ (p=0.853 n=7) FMA-4 606.9n ± 0% 606.8n ± 0% ~ (p=0.393 n=7) geomean 201.1n 185.4n -7.81% Change-Id: I6d41a97ad3789ed5731588588859ac0b8b13b664 Reviewed-on: https://go-review.googlesource.com/c/go/+/484675 Reviewed-by: Rong Zhang <rongrong@oss.cipunited.com> Reviewed-by: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Than McIntosh <thanm@google.com>	2023-05-08 15:53:28 +00:00
Junxian Zhu	574431cfcd	math: optimize math.Abs on mips64x This commit optimized math.Abs function implementation on mips64x. Tested on loongson 3A2000. goos: linux goarch: mips64le pkg: math │ oldmath │ newmath │ │ sec/op │ sec/op vs base │ Acos-4 258.0n ± ∞ ¹ 257.1n ± ∞ ¹ -0.35% (p=0.008 n=5) Acosh-4 417.0n ± ∞ ¹ 377.9n ± ∞ ¹ -9.38% (p=0.008 n=5) Asin-4 248.0n ± ∞ ¹ 259.9n ± ∞ ¹ +4.80% (p=0.008 n=5) Asinh-4 439.6n ± ∞ ¹ 408.3n ± ∞ ¹ -7.12% (p=0.008 n=5) Atan-4 189.6n ± ∞ ¹ 188.8n ± ∞ ¹ ~ (p=0.056 n=5) Atanh-4 390.0n ± ∞ ¹ 356.4n ± ∞ ¹ -8.62% (p=0.008 n=5) Atan2-4 279.0n ± ∞ ¹ 263.9n ± ∞ ¹ -5.41% (p=0.008 n=5) Cbrt-4 314.2n ± ∞ ¹ 322.3n ± ∞ ¹ +2.58% (p=0.008 n=5) Ceil-4 139.7n ± ∞ ¹ 136.6n ± ∞ ¹ -2.22% (p=0.008 n=5) Compare-4 21.11n ± ∞ ¹ 21.09n ± ∞ ¹ ~ (p=0.405 n=5) Compare32-4 20.10n ± ∞ ¹ 20.12n ± ∞ ¹ ~ (p=0.206 n=5) Copysign-4 32.17n ± ∞ ¹ 35.71n ± ∞ ¹ +11.00% (p=0.008 n=5) Cos-4 222.8n ± ∞ ¹ 169.8n ± ∞ ¹ -23.79% (p=0.008 n=5) Cosh-4 550.2n ± ∞ ¹ 477.4n ± ∞ ¹ -13.23% (p=0.008 n=5) Erf-4 171.6n ± ∞ ¹ 174.5n ± ∞ ¹ ~ (p=0.635 n=5) Erfc-4 182.6n ± ∞ ¹ 170.2n ± ∞ ¹ -6.79% (p=0.008 n=5) Erfinv-4 177.6n ± ∞ ¹ 196.6n ± ∞ ¹ +10.70% (p=0.008 n=5) Erfcinv-4 177.8n ± ∞ ¹ 197.8n ± ∞ ¹ +11.25% (p=0.008 n=5) Exp-4 422.8n ± ∞ ¹ 382.1n ± ∞ ¹ -9.63% (p=0.008 n=5) ExpGo-4 416.1n ± ∞ ¹ 383.2n ± ∞ ¹ -7.91% (p=0.008 n=5) Expm1-4 232.9n ± ∞ ¹ 252.2n ± ∞ ¹ +8.29% (p=0.008 n=5) Exp2-4 404.8n ± ∞ ¹ 389.1n ± ∞ ¹ -3.88% (p=0.008 n=5) Exp2Go-4 407.0n ± ∞ ¹ 372.3n ± ∞ ¹ -8.53% (p=0.008 n=5) Abs-4 30.120n ± ∞ ¹ 3.014n ± ∞ ¹ -89.99% (p=0.008 n=5) Dim-4 5.021n ± ∞ ¹ 5.023n ± ∞ ¹ ~ (p=0.071 n=5) Floor-4 127.8n ± ∞ ¹ 127.1n ± ∞ ¹ -0.55% (p=0.008 n=5) Max-4 77.69n ± ∞ ¹ 76.33n ± ∞ ¹ -1.75% (p=0.008 n=5) Min-4 83.27n ± ∞ ¹ 77.87n ± ∞ ¹ -6.48% (p=0.008 n=5) Mod-4 906.2n ± ∞ ¹ 692.9n ± ∞ ¹ -23.54% (p=0.008 n=5) Frexp-4 150.6n ± ∞ ¹ 108.6n ± ∞ ¹ -27.89% (p=0.008 n=5) Gamma-4 418.4n ± ∞ ¹ 386.1n ± ∞ ¹ -7.72% (p=0.008 n=5) Hypot-4 148.20n ± ∞ ¹ 93.78n ± ∞ ¹ -36.72% (p=0.008 n=5) HypotGo-4 148.20n ± ∞ ¹ 94.47n ± ∞ ¹ -36.26% (p=0.008 n=5) Ilogb-4 135.50n ± ∞ ¹ 92.38n ± ∞ ¹ -31.82% (p=0.008 n=5) J0-4 937.7n ± ∞ ¹ 861.7n ± ∞ ¹ -8.10% (p=0.008 n=5) J1-4 915.4n ± ∞ ¹ 875.9n ± ∞ ¹ -4.32% (p=0.008 n=5) Jn-4 1.974µ ± ∞ ¹ 1.863µ ± ∞ ¹ -5.62% (p=0.008 n=5) Ldexp-4 158.5n ± ∞ ¹ 129.3n ± ∞ ¹ -18.42% (p=0.008 n=5) Lgamma-4 209.0n ± ∞ ¹ 211.8n ± ∞ ¹ ~ (p=0.095 n=5) Log-4 326.4n ± ∞ ¹ 295.2n ± ∞ ¹ -9.56% (p=0.008 n=5) Logb-4 147.7n ± ∞ ¹ 105.0n ± ∞ ¹ -28.91% (p=0.008 n=5) Log1p-4 303.4n ± ∞ ¹ 266.3n ± ∞ ¹ -12.23% (p=0.008 n=5) Log10-4 329.2n ± ∞ ¹ 298.3n ± ∞ ¹ -9.39% (p=0.008 n=5) Log2-4 187.4n ± ∞ ¹ 153.0n ± ∞ ¹ -18.36% (p=0.008 n=5) Modf-4 110.5n ± ∞ ¹ 103.5n ± ∞ ¹ -6.33% (p=0.008 n=5) Nextafter32-4 128.4n ± ∞ ¹ 121.5n ± ∞ ¹ -5.37% (p=0.016 n=5) Nextafter64-4 109.5n ± ∞ ¹ 110.5n ± ∞ ¹ +0.91% (p=0.008 n=5) PowInt-4 603.3n ± ∞ ¹ 516.4n ± ∞ ¹ -14.40% (p=0.008 n=5) PowFrac-4 1.365µ ± ∞ ¹ 1.183µ ± ∞ ¹ -13.33% (p=0.008 n=5) Pow10Pos-4 15.07n ± ∞ ¹ 15.07n ± ∞ ¹ ~ (p=0.738 n=5) Pow10Neg-4 21.11n ± ∞ ¹ 21.10n ± ∞ ¹ ~ (p=0.190 n=5) Round-4 44.23n ± ∞ ¹ 44.22n ± ∞ ¹ ~ (p=0.635 n=5) RoundToEven-4 50.25n ± ∞ ¹ 46.27n ± ∞ ¹ -7.92% (p=0.008 n=5) Remainder-4 675.6n ± ∞ ¹ 530.4n ± ∞ ¹ -21.49% (p=0.008 n=5) Signbit-4 17.07n ± ∞ ¹ 17.95n ± ∞ ¹ +5.16% (p=0.008 n=5) Sin-4 171.6n ± ∞ ¹ 189.1n ± ∞ ¹ +10.20% (p=0.008 n=5) Sincos-4 201.5n ± ∞ ¹ 200.5n ± ∞ ¹ ~ (p=0.421 n=5) Sinh-4 529.6n ± ∞ ¹ 484.6n ± ∞ ¹ -8.50% (p=0.008 n=5) SqrtIndirect-4 5.021n ± ∞ ¹ 5.023n ± ∞ ¹ +0.04% (p=0.048 n=5) SqrtLatency-4 8.032n ± ∞ ¹ 8.039n ± ∞ ¹ +0.09% (p=0.024 n=5) SqrtIndirectLatency-4 8.036n ± ∞ ¹ 8.038n ± ∞ ¹ ~ (p=0.056 n=5) SqrtGoLatency-4 338.8n ± ∞ ¹ 338.7n ± ∞ ¹ ~ (p=0.841 n=5) SqrtPrime-4 5.379µ ± ∞ ¹ 5.382µ ± ∞ ¹ +0.06% (p=0.048 n=5) Tan-4 182.7n ± ∞ ¹ 191.8n ± ∞ ¹ +4.98% (p=0.008 n=5) Tanh-4 558.7n ± ∞ ¹ 497.6n ± ∞ ¹ -10.94% (p=0.008 n=5) Trunc-4 122.5n ± ∞ ¹ 122.6n ± ∞ ¹ ~ (p=0.405 n=5) Y0-4 892.8n ± ∞ ¹ 851.7n ± ∞ ¹ -4.60% (p=0.008 n=5) Y1-4 887.2n ± ∞ ¹ 863.2n ± ∞ ¹ -2.71% (p=0.008 n=5) Yn-4 1.889µ ± ∞ ¹ 1.832µ ± ∞ ¹ -3.02% (p=0.008 n=5) Float64bits-4 13.05n ± ∞ ¹ 13.06n ± ∞ ¹ +0.08% (p=0.040 n=5) Float64frombits-4 13.05n ± ∞ ¹ 13.06n ± ∞ ¹ ~ (p=0.143 n=5) Float32bits-4 13.05n ± ∞ ¹ 13.06n ± ∞ ¹ +0.08% (p=0.008 n=5) Float32frombits-4 13.05n ± ∞ ¹ 13.08n ± ∞ ¹ +0.23% (p=0.016 n=5) FMA-4 445.7n ± ∞ ¹ 448.1n ± ∞ ¹ +0.54% (p=0.008 n=5) geomean 157.2n 142.8n -9.17% Change-Id: I9bf104848b588c9ecf79401a81d483d7fcdb0a79 Reviewed-on: https://go-review.googlesource.com/c/go/+/481575 Reviewed-by: M Zhuo <mzh@golangcn.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Than McIntosh <thanm@google.com> Reviewed-by: Bryan Mills <bcmills@google.com> Run-TryBot: Than McIntosh <thanm@google.com> Reviewed-by: Rong Zhang <rongrong@oss.cipunited.com>	2023-05-05 14:54:39 +00:00
Keith Randall	cedf5008a8	cmd/compile: introduce separate memory op combining pass Memory op combining is currently done using arch-specific rewrite rules. Instead, do them as a arch-independent rewrite pass. This ensures that all architectures (with unaligned loads & stores) get equal treatment. This removes a lot of rewrite rules. The new pass is a bit more comprehensive. It handles things like out-of-order writes and is careful not to apply partial optimizations that then block further optimizations. Change-Id: I780ff3bb052475cd725a923309616882d25b8d9e Reviewed-on: https://go-review.googlesource.com/c/go/+/478475 Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com>	2023-04-21 21:05:46 +00:00
Wayne Zuo	96428e160d	cmd/compile: split DIVV/DIVVU op on loong64 Previously, we need calculate both quotient and remainder together. However, in most cases, only one result is needed. By separating these instructions, we can save one instruction in most cases. Change-Id: I0a2d4167cda68ab606783ba1aa2720ede19d6b53 Reviewed-on: https://go-review.googlesource.com/c/go/+/475315 Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>	2023-04-11 01:59:02 +00:00
erifan01	42f99b203d	cmd/compile: optimize cmp to cmn under conditions < and >= on arm64 Under the right conditions we can optimize cmp comparisons to cmn comparisons, such as: func foo(a, b int) int { var c int if a + b < 0 { c = 1 } return c } Previously it's compiled as: ADD R1, R0, R1 CMP $0, R1 CSET LT, R0 With this CL it's compiled as: CMN R1, R0 CSET MI, R0 Here we need to pay attention to the overflow situation of a+b, the MI flag means N==1, which doesn't honor the overflow flag V, its value depends only on the sign of the result. So it has the same semantic of the Go code, so it's correct. Similarly, this CL also optimizes the case of >= comparison using the PL conditional flag. Change-Id: I47179faba5b30cca84ea69bafa2ad5241bf6dfba Reviewed-on: https://go-review.googlesource.com/c/go/+/476116 Run-TryBot: Eric Fang <eric.fang@arm.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-03-24 01:19:09 +00:00
Keith Randall	3360be4a11	cmd/compile: fix extraneous diff in generated files Looks like CL 475735 contained a not-quite-up-to-date version of the generated file. Maybe ABSFL was in an earlier version of the CL and was removed before checkin without regenerating the generated file? In any case, update the generated file. Shouldn't cause a problem, as that field isn't used in x86/ssa.go. Change-Id: I3f0b7d41081ba3ce2cdcae385fea16b37d7de81b Reviewed-on: https://go-review.googlesource.com/c/go/+/477096 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-03-17 04:44:40 +00:00
Wayne Zuo	cedfcba3e8	cmd/compile: instrinsify TrailingZeros{8,32,64} for 386 This CL add support for instrinsifying the TrialingZeros{8,32,64} functions for 386 architecture. We need handle the case when the input is 0, which could lead to undefined output from the BSFL instruction. Next CL will remove the assembly code in runtime/internal/sys package. Change-Id: Ic168edf68e81bf69a536102100fdd3f56f0f4a1b Reviewed-on: https://go-review.googlesource.com/c/go/+/475735 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-03-14 08:10:32 +00:00
Wayne Zuo	14015be5bb	cmd/compile: optimize multiplication on loong64 Previously, multiplication on loong64 architecture was performed using MULV and MULHVU instructions to calculate the low 64-bit and high 64-bit of a multiplication respectively. However, in most cases, only the low 64-bits are needed. This commit enalbes only computating the low 64-bit result with the MULV instruction. Reduce the binary size slightly. file before after Δ % addr2line 2833777 2833849 +72 +0.003% asm 5267499 5266963 -536 -0.010% buildid 2579706 2579402 -304 -0.012% cgo 4798260 4797444 -816 -0.017% compile 25247419 25175030 -72389 -0.287% cover 4973091 4972027 -1064 -0.021% dist 3631013 3565653 -65360 -1.800% doc 4076036 4074004 -2032 -0.050% fix 3496378 3496066 -312 -0.009% link 6984102 6983214 -888 -0.013% nm 2743820 2743516 -304 -0.011% objdump 4277171 4277035 -136 -0.003% pack 2379248 2378872 -376 -0.016% pprof 14419090 14419874 +784 +0.005% test2json 2684386 2684018 -368 -0.014% trace 13640018 13631034 -8984 -0.066% vet 7748918 7752630 +3712 +0.048% go 15643850 15638098 -5752 -0.037% total 127423782 127268729 -155053 -0.122% Change-Id: Ifce4a9a3ed1d03c170681e39cb6f3541db9882dc Reviewed-on: https://go-review.googlesource.com/c/go/+/472775 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: David Chase <drchase@google.com>	2023-03-03 01:33:00 +00:00
Keith Randall	21d82e6ac8	cmd/compile: batch write barrier calls Have the write barrier call return a pointer to a buffer into which the generated code records pointers that need write barrier treatment. Change-Id: I7871764298e0aa1513de417010c8d46b296b199e Reviewed-on: https://go-review.googlesource.com/c/go/+/447781 Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Bypass: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>	2023-02-24 00:21:13 +00:00
Keith Randall	44d22e75dd	cmd/compile: detect write barrier completion differently Instead of keeping track of in which blocks write barriers complete, introduce a new op that marks the exact memory state where the write barrier completes. For future use. This allows us to move some of the write barrier code to between the start of the merging block and the WBend marker. Change-Id: If3809b260292667d91bf0ee18d7b4d0eb1e929f0 Reviewed-on: https://go-review.googlesource.com/c/go/+/447777 Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Keith Randall <khr@golang.org>	2023-02-16 00:16:13 +00:00
Paul E. Murphy	f9da938614	cmd/compile: remove unused ISELB PPC64 ssa opcode The usage of ISELB has been removed as part of changes made to support Power10 SETBC instructions. Change-Id: I2fce4370f48c1eeee65d411dfd1bea4201f45b45 Reviewed-on: https://go-review.googlesource.com/c/go/+/465575 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Paul Murphy <murp@ibm.com> Reviewed-by: Archana Ravindar <aravind5@in.ibm.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>	2023-02-07 17:17:35 +00:00
Archana R	a432d89137	cmd/compile: add rules to emit SETBC/R instructions on power10 This CL adds rules that replaces instances of ISEL that produce a boolean result based on a condition register by SETBC/SETBCR operations. On Power10 these are convereted to SETBC/SETBCR instructions that use one register instead of 3 registers conventionally used by ISEL and hence reduces register pressure. On loops written specifically to exercise such instances of ISEL extensively, a performance improvement of 2.5% is seen on Power10. Also added verification tests to verify correct generation of SETBC/SETBCR instructions on Power10. Change-Id: Ib719897f09d893de40324440a43052dca026e8fa Reviewed-on: https://go-review.googlesource.com/c/go/+/449795 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Archana Ravindar <aravind5@in.ibm.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-02-06 12:49:53 +00:00
Archana R	cd1fc87156	cmd/compile: intrinsify math/bits/ReverseBytes{16\|32\|64} for ppc64/power10 This change intrinsifies ReverseBytes{16\|32\|64} by generating the corresponding new instructions in Power10: brh, brd and brw and adds a verification test for the same. On Power 9 and 8, the .go code performs optimally as it is. Performance improvement seen on Power10: ReverseBytes32 1.38ns ± 0% 1.18ns ± 0% -14.2 ReverseBytes64 1.52ns ± 0% 1.11ns ± 0% -26.87 ReverseBytes16 1.41ns ± 1% 1.18ns ± 0% -16.47 Change-Id: I88f127f3ab9ba24a772becc21ad90acfba324b37 Reviewed-on: https://go-review.googlesource.com/c/go/+/446675 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>	2023-02-03 19:01:06 +00:00
Jorropo	5c67ebbb31	cmd/compile: AMD64v3 remove unnecessary TEST comparision in isPowerOfTwo With GOAMD64=V3 the canonical isPowerOfTwo function: func isPowerOfTwo(x uintptr) bool { return x&(x-1) == 0 } Used to compile to: temp := BLSR(x) // x&(x-1) flags = TEST(temp, temp) return flags.zf However the blsr instruction already set ZF according to the result. So we can remove the TEST instruction if we are just checking ZF. Such as in multiple pieces of code around memory allocations. This make the code smaller and faster. Change-Id: Ia12d5a73aa3cb49188c0b647b1eff7b56c5a7b58 Reviewed-on: https://go-review.googlesource.com/c/go/+/448255 Run-TryBot: Jakub Ciolek <jakub@ciolek.dev> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>	2023-01-20 04:58:59 +00:00
Keith Randall	12befc3ce3	cmd/compile: improve scheduling pass Convert the scheduling pass from scheduling backwards to scheduling forwards. Forward scheduling makes it easier to prioritize scheduling values as soon as they are ready, which is important for things like nil checks, select ops, etc. Forward scheduling is also quite a bit clearer. It was originally backwards because computing uses is tricky, but I found a way to do it simply and with n lg n complexity. The new scheme also makes it easy to add new scheduling edges if needed. Fixes #42673 Update #56568 Change-Id: Ibbb38c52d191f50ce7a94f8c1cbd3cd9b614ea8b Reviewed-on: https://go-review.googlesource.com/c/go/+/270940 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com>	2023-01-20 04:54:01 +00:00
Keith Randall	45dc81d856	cmd/compile: add memory argument to GetCallerSP We need to make sure that when we get the stack pointer, we get it at the right time. V = GetCallerSP Call() W = GetCallerSP If Call causes a stack growth, then we will be in a situation where V != W. So it matters when GetCallerSP operations get scheduled. Add a memory argument to GetCallerSP so it can't be reordered with things like calls. Change-Id: I6cc801134c38e358c5a1ec0c09d38379a16a4184 Reviewed-on: https://go-review.googlesource.com/c/go/+/453515 Reviewed-by: Martin Möhrmann <moehrmann@google.com> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <martin@golang.org> Reviewed-by: Robert Griesemer <gri@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-01-19 22:43:22 +00:00
Keith Randall	f959fb3872	cmd/compile: add anchored version of SP The SPanchored opcode is identical to SP, except that it takes a memory argument so that it (and more importantly, anything that uses it) must be scheduled at or after that memory argument. This opcode ensures that a LEAQ of a variable gets scheduled after the corresponding VARDEF for that variable. This may lead to less CSE of LEAQ operations. The effect is very small. The go binary is only 80 bytes bigger after this CL. Usually LEAQs get folded into load/store operations, so the effect is only for pointerful types, large enough to need a duffzero, and have their address passed somewhere. Even then, usually the CSEd LEAQs will be un-CSEd because the two uses are on different sides of a function call and the LEAQ ends up being rematerialized at the second use anyway. Change-Id: Ib893562cd05369b91dd563b48fb83f5250950293 Reviewed-on: https://go-review.googlesource.com/c/go/+/452916 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Martin Möhrmann <martin@golang.org> Reviewed-by: Keith Randall <khr@google.com>	2023-01-19 22:43:12 +00:00
Dmitri Shuralyov	47a0d46716	cmd/compile/internal/ssa: generate code via a //go:generate directive The standard way to generate code in a Go package is via //go:generate directives, which are invoked by the developer explicitly running: go generate import/path/of/said/package Switch to using that approach here. This way, developers don't need to learn and remember a custom way that each particular Go package may choose to implement its code generation. It also enables conveniences such as 'go generate -n' to discover how code is generated without running anything (this works on all packages that rely on //go:generate directives), being able to generate multiple packages at once and from any directory, and so on. Change-Id: I0e5b6a1edeff670a8e588befeef0c445613803c7 Reviewed-on: https://go-review.googlesource.com/c/go/+/460135 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org> Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2023-01-19 22:42:34 +00:00
Keith Randall	5f7abeca5a	cmd/compile: teach regalloc about temporary registers Temporary registers are sometimes needed for an architecture backend which needs to use several machine instructions to implement a single SSA instruction. Mark such instructions so that regalloc can reserve the temporary register for it. That way we don't have to reserve a fixed register like we do now. Convert the temp-register-using instructions on amd64 to use this new mechanism. Other archs can follow as needed. Change-Id: I1d0c8588afdad5cd18b4398eb5a0f755be5dead7 Reviewed-on: https://go-review.googlesource.com/c/go/+/398556 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com>	2022-11-17 18:53:13 +00:00

1 2 3 4 5 ...

600 Commits