mirror/go - go - Git Fam. Sieh

Commit Graph

Author	SHA1	Message	Date
Austin Clements	1d20a362d0	math: avoid assembly stubs Currently almost all math functions have the following pattern: func Sin(x float64) float64 func sin(x float64) float64 { // ... pure Go implementation ... } Architectures that implement a function in assembly provide the assembly implementation directly as the exported function (e.g., Sin), and architectures that don't implement it in assembly use a small stub to jump back to the Go code, like: TEXT ·Sin(SB), NOSPLIT, $0 JMP ·sin(SB) However, most functions are not implemented in assembly on most architectures, so this jump through assembly is a waste. It defeats compiler optimizations like inlining. And, with regabi, it actually adds a small but non-trivial overhead because the jump from assembly back to Go must go through an ABI0->ABIInternal bridge function. Hence, this CL reorganizes this structure across the entire package. It now leans on inlining to achieve peak performance, but allows the compiler to see all the way through the pure Go implementation. Now, functions follow this pattern: func Sin(x float64) float64 { if haveArchSin { return archSin(x) } return sin(x) } func sin(x float64) float64 { // ... pure Go implementation ... } Architectures that have assembly implementations use build-tagged files to set haveArchX to true an provide an archX implementation. That implementation can also still call back into the Go implementation (some of them do this). Prior to this change, enabling ABI wrappers results in a geomean slowdown of the math benchmarks of 8.77% (full results: https://perf.golang.org/search?q=upload:20210415.6) and of the Tile38 benchmarks by ~4%. After this change, enabling ABI wrappers is completely performance-neutral on Tile38 and all but one math benchmark (full results: https://perf.golang.org/search?q=upload:20210415.7). ABI wrappers slow down SqrtIndirectLatency-12 by 2.09%, which makes sense because that call must still go through an ABI wrapper. With ABI wrappers disabled (which won't be an option on amd64 much longer), on linux/amd64, this change is largely performance-neutral and slightly improves the performance of a few benchmarks: (Because there are so many benchmarks, I've applied the Šidák correction to the alpha threshold. It makes relatively little difference in which benchmarks are statistically significant.) name old time/op new time/op delta Acos-12 22.3ns ± 0% 18.8ns ± 1% -15.44% (p=0.000 n=18+16) Acosh-12 28.2ns ± 0% 28.2ns ± 0% ~ (p=0.404 n=18+20) Asin-12 18.1ns ± 0% 18.2ns ± 0% +0.20% (p=0.000 n=18+16) Asinh-12 32.8ns ± 0% 32.9ns ± 1% ~ (p=0.891 n=18+20) Atan-12 9.92ns ± 0% 9.90ns ± 1% -0.24% (p=0.000 n=17+16) Atanh-12 27.7ns ± 0% 27.5ns ± 0% -0.72% (p=0.000 n=16+20) Atan2-12 18.5ns ± 0% 18.4ns ± 0% -0.59% (p=0.000 n=19+19) Cbrt-12 22.1ns ± 0% 22.1ns ± 0% ~ (p=0.804 n=16+17) Ceil-12 0.84ns ± 0% 0.84ns ± 0% ~ (p=0.663 n=18+16) Copysign-12 0.84ns ± 0% 0.84ns ± 0% ~ (p=0.762 n=16+19) Cos-12 12.7ns ± 0% 12.7ns ± 1% ~ (p=0.145 n=19+18) Cosh-12 22.2ns ± 0% 22.5ns ± 0% +1.60% (p=0.000 n=17+19) Erf-12 11.1ns ± 1% 11.1ns ± 1% ~ (p=0.010 n=19+19) Erfc-12 12.6ns ± 1% 12.7ns ± 0% ~ (p=0.066 n=19+15) Erfinv-12 16.1ns ± 0% 16.1ns ± 0% ~ (p=0.462 n=17+20) Erfcinv-12 16.0ns ± 1% 16.0ns ± 1% ~ (p=0.015 n=17+16) Exp-12 16.3ns ± 0% 16.5ns ± 1% +1.25% (p=0.000 n=19+16) ExpGo-12 36.2ns ± 1% 36.1ns ± 1% ~ (p=0.242 n=20+18) Expm1-12 18.6ns ± 0% 18.7ns ± 0% +0.25% (p=0.000 n=16+19) Exp2-12 34.7ns ± 0% 34.6ns ± 1% ~ (p=0.010 n=19+18) Exp2Go-12 34.8ns ± 1% 34.8ns ± 1% ~ (p=0.372 n=19+19) Abs-12 0.56ns ± 0% 0.56ns ± 0% ~ (p=0.766 n=18+16) Dim-12 0.84ns ± 1% 0.84ns ± 1% ~ (p=0.167 n=17+19) Floor-12 0.84ns ± 0% 0.84ns ± 0% ~ (p=0.993 n=18+16) Max-12 3.35ns ± 0% 3.35ns ± 0% ~ (p=0.894 n=17+19) Min-12 3.35ns ± 0% 3.36ns ± 1% ~ (p=0.214 n=18+18) Mod-12 35.2ns ± 0% 34.7ns ± 0% -1.45% (p=0.000 n=18+17) Frexp-12 5.31ns ± 0% 4.75ns ± 0% -10.51% (p=0.000 n=19+18) Gamma-12 14.8ns ± 0% 16.2ns ± 1% +9.21% (p=0.000 n=20+19) Hypot-12 6.16ns ± 0% 6.17ns ± 0% +0.26% (p=0.000 n=19+20) HypotGo-12 7.79ns ± 1% 7.78ns ± 0% ~ (p=0.497 n=18+17) Ilogb-12 4.47ns ± 0% 4.47ns ± 0% ~ (p=0.167 n=19+19) J0-12 76.0ns ± 0% 76.3ns ± 0% +0.35% (p=0.000 n=19+18) J1-12 76.8ns ± 1% 75.9ns ± 0% -1.14% (p=0.000 n=18+18) Jn-12 167ns ± 1% 168ns ± 1% ~ (p=0.038 n=18+18) Ldexp-12 6.98ns ± 0% 6.43ns ± 0% -7.97% (p=0.000 n=17+18) Lgamma-12 15.9ns ± 0% 16.0ns ± 1% ~ (p=0.011 n=20+17) Log-12 13.3ns ± 0% 13.4ns ± 1% +0.37% (p=0.000 n=15+18) Logb-12 4.75ns ± 0% 4.75ns ± 0% ~ (p=0.831 n=16+18) Log1p-12 19.5ns ± 0% 19.5ns ± 1% ~ (p=0.851 n=18+17) Log10-12 15.9ns ± 0% 14.0ns ± 0% -11.92% (p=0.000 n=17+16) Log2-12 7.88ns ± 1% 8.01ns ± 0% +1.72% (p=0.000 n=20+20) Modf-12 4.75ns ± 0% 4.34ns ± 0% -8.66% (p=0.000 n=19+17) Nextafter32-12 5.31ns ± 0% 5.31ns ± 0% ~ (p=0.389 n=17+18) Nextafter64-12 5.03ns ± 1% 5.03ns ± 0% ~ (p=0.774 n=17+18) PowInt-12 29.9ns ± 0% 28.5ns ± 0% -4.69% (p=0.000 n=18+19) PowFrac-12 91.0ns ± 0% 91.1ns ± 0% ~ (p=0.029 n=19+19) Pow10Pos-12 1.12ns ± 0% 1.12ns ± 0% ~ (p=0.363 n=20+20) Pow10Neg-12 3.90ns ± 0% 3.90ns ± 0% ~ (p=0.921 n=17+18) Round-12 2.31ns ± 0% 2.31ns ± 1% ~ (p=0.390 n=18+18) RoundToEven-12 0.84ns ± 0% 0.84ns ± 0% ~ (p=0.280 n=18+19) Remainder-12 31.6ns ± 0% 29.6ns ± 0% -6.16% (p=0.000 n=18+17) Signbit-12 0.56ns ± 0% 0.56ns ± 0% ~ (p=0.385 n=19+18) Sin-12 12.5ns ± 0% 12.5ns ± 0% ~ (p=0.080 n=18+18) Sincos-12 16.4ns ± 2% 16.4ns ± 2% ~ (p=0.253 n=20+19) Sinh-12 26.1ns ± 0% 26.1ns ± 0% +0.18% (p=0.000 n=17+19) SqrtIndirect-12 3.91ns ± 0% 3.90ns ± 0% ~ (p=0.133 n=19+19) SqrtLatency-12 2.79ns ± 0% 2.79ns ± 0% ~ (p=0.226 n=16+19) SqrtIndirectLatency-12 6.68ns ± 0% 6.37ns ± 2% -4.66% (p=0.000 n=17+20) SqrtGoLatency-12 49.4ns ± 0% 49.4ns ± 0% ~ (p=0.289 n=18+16) SqrtPrime-12 3.18µs ± 0% 3.18µs ± 0% ~ (p=0.084 n=17+18) Tan-12 13.8ns ± 0% 13.9ns ± 2% ~ (p=0.292 n=19+20) Tanh-12 25.4ns ± 0% 25.4ns ± 0% ~ (p=0.101 n=17+17) Trunc-12 0.84ns ± 0% 0.84ns ± 0% ~ (p=0.765 n=18+16) Y0-12 75.8ns ± 0% 75.9ns ± 1% ~ (p=0.805 n=16+18) Y1-12 76.3ns ± 0% 75.3ns ± 1% -1.34% (p=0.000 n=19+17) Yn-12 164ns ± 0% 164ns ± 2% ~ (p=0.356 n=18+20) Float64bits-12 0.56ns ± 0% 0.56ns ± 0% ~ (p=0.383 n=18+18) Float64frombits-12 0.56ns ± 0% 0.56ns ± 0% ~ (p=0.066 n=18+19) Float32bits-12 0.56ns ± 0% 0.56ns ± 0% ~ (p=0.889 n=16+19) Float32frombits-12 0.56ns ± 0% 0.56ns ± 0% ~ (p=0.007 n=18+19) FMA-12 23.9ns ± 0% 24.0ns ± 0% +0.31% (p=0.000 n=16+17) [Geo mean] 9.86ns 9.77ns -0.87% (https://perf.golang.org/search?q=upload:20210415.5) For #40724. Change-Id: I44fbba2a17be930ec9daeb0a8222f55cd50555a0 Reviewed-on: https://go-review.googlesource.com/c/go/+/310331 Trust: Austin Clements <austin@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2021-04-15 15:48:19 +00:00
Ilya Tocar	c15984c6c6	math: remove unused variable useSSE41 was used inside asm implementation of floor to select between base and ss4 code path. We intrinsified floor and left asm functions as a backup for non-sse4 systems. This made variable unused, so remove it. Change-Id: Ia2633de7c7cb1ef1d5b15a2366b523e481b722d9 Reviewed-on: https://go-review.googlesource.com/97935 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-03-01 18:51:44 +00:00
Martin Möhrmann	69972aea74	internal/cpu: new package to detect cpu features Implements detection of x86 cpu features that are used in the go standard library. Changes all standard library packages to use the new cpu package instead of using runtime internal variables to check x86 cpu features. Updates: #15403 Change-Id: I2999a10cb4d9ec4863ffbed72f4e021a1dbc4bb9 Reviewed-on: https://go-review.googlesource.com/41476 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-05-10 17:02:21 +00:00
Ilya Tocar	37cfb2e07e	math: optimize ceil/floor functions on amd64 Use SSE 4.1 rounding instruction to perform rounding Results (haswell): name old time/op new time/op delta Floor-48 2.71ns ± 0% 1.87ns ± 1% -31.17% (p=0.000 n=16+19) Ceil-48 3.09ns ± 3% 2.16ns ± 0% -30.16% (p=0.000 n=19+12) Change-Id: If63715879eed6530b1eb4fc96132d827f8f43909 Reviewed-on: https://go-review.googlesource.com/14561 Reviewed-by: Klaus Post <klauspost@gmail.com> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2015-10-03 15:55:08 +00:00

4 Commits