mirror of https://github.com/golang/go.git
15 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
60b500dc6c |
math/big: remove bounds checks for shrVU_g inner loop
Make explicit a shrVU_g precondition. Replace i with i+1 throughout the loop. The resulting loop is functionally identical, but the compiler can do better BCE without the i-1 slice offset. Benchmarks results on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta NonZeroShifts/1/shrVU-8 4.55ns ± 2% 4.45ns ± 3% -2.27% (p=0.000 n=28+30) NonZeroShifts/1/shlVU-8 4.07ns ± 1% 4.13ns ± 4% +1.55% (p=0.000 n=26+29) NonZeroShifts/2/shrVU-8 6.12ns ± 1% 5.55ns ± 1% -9.30% (p=0.000 n=28+28) NonZeroShifts/2/shlVU-8 5.65ns ± 3% 5.70ns ± 2% +0.92% (p=0.008 n=30+29) NonZeroShifts/3/shrVU-8 7.58ns ± 2% 6.79ns ± 2% -10.46% (p=0.000 n=28+28) NonZeroShifts/3/shlVU-8 6.62ns ± 2% 6.69ns ± 1% +1.07% (p=0.000 n=29+28) NonZeroShifts/4/shrVU-8 9.02ns ± 1% 7.79ns ± 2% -13.59% (p=0.000 n=27+30) NonZeroShifts/4/shlVU-8 7.74ns ± 1% 7.82ns ± 1% +0.92% (p=0.000 n=26+28) NonZeroShifts/5/shrVU-8 10.6ns ± 1% 8.9ns ± 3% -16.31% (p=0.000 n=25+29) NonZeroShifts/5/shlVU-8 8.59ns ± 1% 8.68ns ± 1% +1.13% (p=0.000 n=27+29) NonZeroShifts/10/shrVU-8 18.2ns ± 2% 14.4ns ± 1% -20.96% (p=0.000 n=27+28) NonZeroShifts/10/shlVU-8 14.1ns ± 1% 14.1ns ± 1% +0.46% (p=0.001 n=26+28) NonZeroShifts/100/shrVU-8 161ns ± 2% 118ns ± 1% -26.83% (p=0.000 n=29+30) NonZeroShifts/100/shlVU-8 119ns ± 2% 120ns ± 2% +0.92% (p=0.000 n=29+29) NonZeroShifts/1000/shrVU-8 1.54µs ± 1% 1.10µs ± 1% -28.63% (p=0.000 n=29+29) NonZeroShifts/1000/shlVU-8 1.10µs ± 1% 1.10µs ± 2% ~ (p=0.701 n=28+29) NonZeroShifts/10000/shrVU-8 15.3µs ± 2% 10.9µs ± 1% -28.68% (p=0.000 n=28+28) NonZeroShifts/10000/shlVU-8 10.9µs ± 2% 10.9µs ± 2% -0.57% (p=0.003 n=26+29) NonZeroShifts/100000/shrVU-8 154µs ± 1% 111µs ± 2% -28.04% (p=0.000 n=27+28) NonZeroShifts/100000/shlVU-8 113µs ± 2% 113µs ± 2% ~ (p=0.790 n=30+30) Change-Id: Ib6a621ee7c88b27f0f18121fb2cba3606c40c9b0 Reviewed-on: https://go-review.googlesource.com/c/go/+/297049 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org> |
|
|
|
d54a9a9c42 |
math/big: replace division with multiplication by reciprocal word
Division is much slower than multiplication. And the method of using multiplication by multiplying reciprocal and replacing division with it can increase the speed of divWVW algorithm by three times,and at the same time increase the speed of nats division. The benchmark test on arm64 is as follows: name old time/op new time/op delta DivWVW/1-4 13.1ns ± 4% 13.3ns ± 4% ~ (p=0.444 n=5+5) DivWVW/2-4 48.6ns ± 1% 51.2ns ± 2% +5.39% (p=0.008 n=5+5) DivWVW/3-4 82.0ns ± 1% 69.7ns ± 1% -15.03% (p=0.008 n=5+5) DivWVW/4-4 116ns ± 1% 71ns ± 2% -38.88% (p=0.008 n=5+5) DivWVW/5-4 152ns ± 1% 84ns ± 4% -44.70% (p=0.008 n=5+5) DivWVW/10-4 319ns ± 1% 155ns ± 4% -51.50% (p=0.008 n=5+5) DivWVW/100-4 3.44µs ± 3% 1.30µs ± 8% -62.30% (p=0.008 n=5+5) DivWVW/1000-4 33.8µs ± 0% 10.9µs ± 1% -67.74% (p=0.008 n=5+5) DivWVW/10000-4 343µs ± 4% 111µs ± 5% -67.63% (p=0.008 n=5+5) DivWVW/100000-4 3.35ms ± 1% 1.25ms ± 3% -62.79% (p=0.008 n=5+5) QuoRem-4 3.08µs ± 2% 2.21µs ± 4% -28.40% (p=0.008 n=5+5) ModSqrt225_Tonelli-4 444µs ± 2% 457µs ± 3% ~ (p=0.095 n=5+5) ModSqrt225_3Mod4-4 136µs ± 1% 138µs ± 3% ~ (p=0.151 n=5+5) ModSqrt231_Tonelli-4 473µs ± 3% 483µs ± 4% ~ (p=0.548 n=5+5) ModSqrt231_5Mod8-4 164µs ± 9% 169µs ±12% ~ (p=0.421 n=5+5) Sqrt-4 36.8µs ± 1% 28.6µs ± 0% -22.17% (p=0.016 n=5+4) Div/20/10-4 50.0ns ± 3% 51.3ns ± 6% ~ (p=0.238 n=5+5) Div/40/20-4 49.8ns ± 2% 51.3ns ± 6% ~ (p=0.222 n=5+5) Div/100/50-4 85.8ns ± 4% 86.5ns ± 5% ~ (p=0.246 n=5+5) Div/200/100-4 335ns ± 3% 296ns ± 2% -11.60% (p=0.008 n=5+5) Div/400/200-4 442ns ± 2% 359ns ± 5% -18.81% (p=0.008 n=5+5) Div/1000/500-4 858ns ± 3% 643ns ± 6% -25.06% (p=0.008 n=5+5) Div/2000/1000-4 1.70µs ± 3% 1.28µs ± 4% -24.80% (p=0.008 n=5+5) Div/20000/10000-4 45.0µs ± 5% 41.8µs ± 4% -7.17% (p=0.016 n=5+5) Div/200000/100000-4 1.51ms ± 7% 1.43ms ± 3% -5.42% (p=0.016 n=5+5) Div/2000000/1000000-4 57.6ms ± 4% 57.5ms ± 3% ~ (p=1.000 n=5+5) Div/20000000/10000000-4 2.08s ± 3% 2.04s ± 1% ~ (p=0.095 n=5+5) name old speed new speed delta DivWVW/1-4 4.87GB/s ± 4% 4.80GB/s ± 4% ~ (p=0.310 n=5+5) DivWVW/2-4 2.63GB/s ± 1% 2.50GB/s ± 2% -5.07% (p=0.008 n=5+5) DivWVW/3-4 2.34GB/s ± 1% 2.76GB/s ± 1% +17.70% (p=0.008 n=5+5) DivWVW/4-4 2.21GB/s ± 1% 3.61GB/s ± 2% +63.42% (p=0.008 n=5+5) DivWVW/5-4 2.10GB/s ± 2% 3.81GB/s ± 4% +80.89% (p=0.008 n=5+5) DivWVW/10-4 2.01GB/s ± 0% 4.13GB/s ± 4% +105.91% (p=0.008 n=5+5) DivWVW/100-4 1.86GB/s ± 2% 4.95GB/s ± 7% +165.63% (p=0.008 n=5+5) DivWVW/1000-4 1.89GB/s ± 0% 5.86GB/s ± 1% +209.96% (p=0.008 n=5+5) DivWVW/10000-4 1.87GB/s ± 4% 5.76GB/s ± 5% +208.96% (p=0.008 n=5+5) DivWVW/100000-4 1.91GB/s ± 1% 5.14GB/s ± 3% +168.85% (p=0.008 n=5+5) Change-Id: I049f1196562b20800e6ef8a6493fd147f93ad830 Reviewed-on: https://go-review.googlesource.com/c/go/+/250417 Trust: Giovanni Bajo <rasky@develer.com> Trust: Keith Randall <khr@golang.org> Run-TryBot: Giovanni Bajo <rasky@develer.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> |
|
|
|
964fe4b80f |
math/big: simplify shlVU_g and shrVU_g
Rewrote a few lines to be more idiomatic/less assembly-ish.
Benchmarked with `go test -bench Float -tags math_big_pure_go`:
name old time/op new time/op delta
FloatString/100-8 751ns ± 0% 746ns ± 1% -0.71% (p=0.000 n=10+10)
FloatString/1000-8 22.9µs ± 0% 22.9µs ± 0% ~ (p=0.271 n=10+10)
FloatString/10000-8 1.89ms ± 0% 1.89ms ± 0% ~ (p=0.481 n=10+10)
FloatString/100000-8 184ms ± 0% 184ms ± 0% ~ (p=0.094 n=9+9)
FloatAdd/10-8 56.4ns ± 1% 56.5ns ± 0% ~ (p=0.170 n=9+9)
FloatAdd/100-8 59.7ns ± 0% 59.3ns ± 0% -0.70% (p=0.000 n=8+9)
FloatAdd/1000-8 101ns ± 0% 99ns ± 0% -1.89% (p=0.000 n=8+8)
FloatAdd/10000-8 553ns ± 0% 536ns ± 0% -3.00% (p=0.000 n=9+10)
FloatAdd/100000-8 4.94µs ± 0% 4.74µs ± 0% -3.94% (p=0.000 n=9+10)
FloatSub/10-8 50.3ns ± 0% 50.5ns ± 0% +0.52% (p=0.000 n=8+8)
FloatSub/100-8 52.0ns ± 0% 52.2ns ± 1% +0.46% (p=0.012 n=8+10)
FloatSub/1000-8 77.9ns ± 0% 77.3ns ± 0% -0.80% (p=0.000 n=7+8)
FloatSub/10000-8 371ns ± 0% 362ns ± 0% -2.67% (p=0.000 n=10+10)
FloatSub/100000-8 3.20µs ± 0% 3.10µs ± 0% -3.16% (p=0.000 n=10+10)
ParseFloatSmallExp-8 7.84µs ± 0% 7.82µs ± 0% -0.17% (p=0.037 n=9+9)
ParseFloatLargeExp-8 29.3µs ± 1% 29.5µs ± 0% ~ (p=0.059 n=9+8)
FloatSqrt/64-8 516ns ± 0% 519ns ± 0% +0.54% (p=0.000 n=9+9)
FloatSqrt/128-8 1.07µs ± 0% 1.07µs ± 0% ~ (p=0.109 n=8+9)
FloatSqrt/256-8 1.23µs ± 0% 1.23µs ± 0% +0.50% (p=0.000 n=9+9)
FloatSqrt/1000-8 3.43µs ± 0% 3.44µs ± 0% +0.53% (p=0.000 n=9+8)
FloatSqrt/10000-8 40.9µs ± 0% 40.7µs ± 0% -0.39% (p=0.000 n=9+8)
FloatSqrt/100000-8 1.07ms ± 0% 1.07ms ± 0% -0.10% (p=0.017 n=10+9)
FloatSqrt/1000000-8 89.3ms ± 0% 89.2ms ± 0% -0.07% (p=0.015 n=9+8)
Change-Id: Ibf07c6142719d11bc7f329246957d87a9f3ba3d2
GitHub-Last-Rev:
|
|
|
|
fe24837c4d |
math/big: add fast path for pure Go addVW for large z
In the normal case, only a few words have to be updated when adding a word to a vector. When that happens, we can simply copy the rest of the words, which is much faster. However, the overhead of that makes it prohibitive for small vectors, so we check the size at the beginning. The implementation is a bit weird to allow addVW to continued to be inlined; see #30548. The AddVW benchmarks are surprising, but fully repeatable. The SubVW benchmarks are more or less as expected. I expect that removing the indirect function call will help both and make them a bit more normal. name old time/op new time/op delta AddVW/1-8 4.27ns ± 2% 3.81ns ± 3% -10.83% (p=0.000 n=89+90) AddVW/2-8 4.91ns ± 2% 4.34ns ± 1% -11.60% (p=0.000 n=83+90) AddVW/3-8 5.77ns ± 4% 5.76ns ± 2% ~ (p=0.365 n=91+87) AddVW/4-8 6.03ns ± 1% 6.03ns ± 1% ~ (p=0.392 n=80+76) AddVW/5-8 6.48ns ± 2% 6.63ns ± 1% +2.27% (p=0.000 n=76+74) AddVW/10-8 9.56ns ± 2% 9.56ns ± 1% -0.02% (p=0.002 n=69+76) AddVW/100-8 90.6ns ± 0% 18.1ns ± 4% -79.99% (p=0.000 n=72+94) AddVW/1000-8 865ns ± 0% 85ns ± 6% -90.14% (p=0.000 n=66+96) AddVW/10000-8 8.57µs ± 2% 1.82µs ± 3% -78.73% (p=0.000 n=99+94) AddVW/100000-8 84.4µs ± 2% 31.8µs ± 4% -62.29% (p=0.000 n=93+98) name old time/op new time/op delta SubVW/1-8 3.90ns ± 2% 4.13ns ± 4% +6.02% (p=0.000 n=92+95) SubVW/2-8 4.15ns ± 1% 5.20ns ± 1% +25.22% (p=0.000 n=83+85) SubVW/3-8 5.50ns ± 2% 6.22ns ± 6% +13.21% (p=0.000 n=91+97) SubVW/4-8 5.99ns ± 1% 6.63ns ± 1% +10.63% (p=0.000 n=79+61) SubVW/5-8 6.75ns ± 4% 6.88ns ± 2% +1.82% (p=0.000 n=98+73) SubVW/10-8 9.57ns ± 1% 9.56ns ± 1% -0.13% (p=0.000 n=77+64) SubVW/100-8 90.3ns ± 1% 18.1ns ± 2% -80.00% (p=0.000 n=75+94) SubVW/1000-8 860ns ± 4% 85ns ± 7% -90.14% (p=0.000 n=97+99) SubVW/10000-8 8.51µs ± 3% 1.77µs ± 6% -79.21% (p=0.000 n=100+97) SubVW/100000-8 84.4µs ± 3% 31.5µs ± 3% -62.66% (p=0.000 n=92+92) Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937 Reviewed-on: https://go-review.googlesource.com/c/go/+/164968 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org> |
|
|
|
4c227a091e |
math/big: remove bounds checks in pure Go implementations
These routines are quite sensitive to BCE.
This change eliminates bounds checks from loops.
It does so at the cost of a bit of safety:
malformed input will now return incorrect answers
instead of panicking.
This isn't as bad as it sounds: math/big has very good
test coverage, and the alternative implementations are in
assembly, which could do much worse things with malformed input.
If the compiler's BCE improves, so could these routines.
Notable BCE improvements for these routines would be:
* Allowing and propagating more cross-slice length hints.
Then hints like _ = y[:len(z)] would eliminate bounds checks for y[i].
* Propagating enough information so that we could do
n := len(x)
if len(z) < n {
n = len(z)
}
and then have i < n eliminate the same bounds checks as
i < len(x) && i < len(z) currently does.
* Providing some way to do BCE for unrolled loops.
Now that we have math/bits implementations,
it is possible to write things like ADC chains in
pure Go, if you can reasonably unroll loops.
Benchmarks below are for amd64, using -tags=math_big_pure_go.
name old time/op new time/op delta
AddVV/1-8 5.15ns ± 3% 4.65ns ± 4% -9.81% (p=0.000 n=93+86)
AddVV/2-8 6.40ns ± 2% 5.58ns ± 4% -12.78% (p=0.000 n=90+95)
AddVV/3-8 7.07ns ± 2% 6.66ns ± 2% -5.88% (p=0.000 n=87+83)
AddVV/4-8 7.94ns ± 5% 7.41ns ± 4% -6.65% (p=0.000 n=94+98)
AddVV/5-8 8.55ns ± 1% 8.80ns ± 0% +2.92% (p=0.000 n=87+92)
AddVV/10-8 12.7ns ± 1% 12.3ns ± 1% -3.12% (p=0.000 n=83+71)
AddVV/100-8 119ns ± 5% 117ns ± 4% -1.60% (p=0.000 n=93+90)
AddVV/1000-8 1.14µs ± 4% 1.14µs ± 5% ~ (p=0.812 n=95+91)
AddVV/10000-8 11.4µs ± 5% 11.3µs ± 5% ~ (p=0.503 n=97+96)
AddVV/100000-8 114µs ± 4% 113µs ± 5% -0.98% (p=0.002 n=97+90)
name old time/op new time/op delta
SubVV/1-8 5.23ns ± 5% 4.65ns ± 3% -11.18% (p=0.000 n=89+91)
SubVV/2-8 6.49ns ± 5% 5.58ns ± 3% -14.04% (p=0.000 n=92+94)
SubVV/3-8 7.10ns ± 3% 6.65ns ± 2% -6.28% (p=0.000 n=87+80)
SubVV/4-8 8.04ns ± 1% 7.44ns ± 5% -7.49% (p=0.000 n=83+98)
SubVV/5-8 8.55ns ± 2% 8.32ns ± 1% -2.75% (p=0.000 n=84+92)
SubVV/10-8 12.7ns ± 1% 12.3ns ± 1% -3.09% (p=0.000 n=80+75)
SubVV/100-8 119ns ± 0% 116ns ± 3% -1.83% (p=0.000 n=87+98)
SubVV/1000-8 1.13µs ± 5% 1.13µs ± 3% ~ (p=0.082 n=96+98)
SubVV/10000-8 11.2µs ± 1% 11.3µs ± 3% +0.76% (p=0.000 n=87+97)
SubVV/100000-8 112µs ± 2% 113µs ± 3% +0.55% (p=0.000 n=76+88)
name old time/op new time/op delta
AddVW/1-8 4.30ns ± 4% 3.96ns ± 6% -8.02% (p=0.000 n=89+97)
AddVW/2-8 5.15ns ± 2% 4.91ns ± 1% -4.56% (p=0.000 n=87+80)
AddVW/3-8 5.59ns ± 3% 5.75ns ± 2% +2.91% (p=0.000 n=91+88)
AddVW/4-8 6.20ns ± 1% 6.03ns ± 1% -2.71% (p=0.000 n=75+90)
AddVW/5-8 6.93ns ± 3% 6.49ns ± 2% -6.35% (p=0.000 n=100+82)
AddVW/10-8 10.0ns ± 7% 9.6ns ± 0% -4.02% (p=0.000 n=98+74)
AddVW/100-8 91.1ns ± 1% 90.6ns ± 1% -0.55% (p=0.000 n=84+80)
AddVW/1000-8 866ns ± 1% 856ns ± 4% -1.06% (p=0.000 n=69+96)
AddVW/10000-8 8.64µs ± 1% 8.53µs ± 4% -1.25% (p=0.000 n=67+99)
AddVW/100000-8 84.3µs ± 2% 85.4µs ± 4% +1.22% (p=0.000 n=89+99)
name old time/op new time/op delta
SubVW/1-8 4.28ns ± 2% 3.82ns ± 3% -10.63% (p=0.000 n=91+89)
SubVW/2-8 4.61ns ± 1% 4.48ns ± 3% -2.67% (p=0.000 n=94+96)
SubVW/3-8 5.54ns ± 1% 5.81ns ± 4% +4.87% (p=0.000 n=92+97)
SubVW/4-8 6.20ns ± 1% 6.08ns ± 2% -1.99% (p=0.000 n=71+88)
SubVW/5-8 6.91ns ± 3% 6.64ns ± 1% -3.90% (p=0.000 n=97+70)
SubVW/10-8 9.85ns ± 2% 9.62ns ± 0% -2.31% (p=0.000 n=82+62)
SubVW/100-8 91.1ns ± 1% 90.9ns ± 3% -0.14% (p=0.010 n=71+93)
SubVW/1000-8 859ns ± 3% 867ns ± 1% +0.98% (p=0.000 n=99+78)
SubVW/10000-8 8.54µs ± 5% 8.57µs ± 2% +0.38% (p=0.007 n=98+92)
SubVW/100000-8 84.5µs ± 3% 84.6µs ± 3% ~ (p=0.334 n=95+94)
name old time/op new time/op delta
AddMulVVW/1-8 5.43ns ± 3% 4.36ns ± 2% -19.67% (p=0.000 n=95+94)
AddMulVVW/2-8 6.56ns ± 4% 6.11ns ± 1% -6.90% (p=0.000 n=91+91)
AddMulVVW/3-8 8.00ns ± 1% 7.80ns ± 4% -2.52% (p=0.000 n=83+95)
AddMulVVW/4-8 9.81ns ± 2% 9.53ns ± 1% -2.86% (p=0.000 n=77+64)
AddMulVVW/5-8 11.4ns ± 3% 11.3ns ± 5% -0.89% (p=0.000 n=95+97)
AddMulVVW/10-8 18.9ns ± 5% 19.1ns ± 5% +0.89% (p=0.000 n=91+94)
AddMulVVW/100-8 165ns ± 5% 165ns ± 4% ~ (p=0.427 n=97+98)
AddMulVVW/1000-8 1.56µs ± 3% 1.56µs ± 4% ~ (p=0.167 n=98+96)
AddMulVVW/10000-8 15.7µs ± 5% 15.6µs ± 5% -0.31% (p=0.044 n=95+97)
AddMulVVW/100000-8 156µs ± 3% 157µs ± 8% ~ (p=0.373 n=72+99)
Change-Id: Ibc720785d5b95f6a797103b1363843205f4d56bf
Reviewed-on: https://go-review.googlesource.com/c/go/+/164966
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
|
|
d5edbcac98 |
math/big: rewrite pure Go implementations to use math/bits
While we're here, delete addWW_g and subWW_g, per the TODO. They are now obsolete. Benchmarks on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta AddVV/1-8 5.24ns ± 2% 5.12ns ± 1% -2.11% (p=0.000 n=82+87) AddVV/2-8 6.44ns ± 1% 6.33ns ± 2% -1.82% (p=0.000 n=77+82) AddVV/3-8 7.89ns ± 8% 6.97ns ± 4% -11.71% (p=0.000 n=100+96) AddVV/4-8 8.60ns ± 0% 7.72ns ± 4% -10.24% (p=0.000 n=90+96) AddVV/5-8 10.3ns ± 4% 8.5ns ± 1% -17.02% (p=0.000 n=96+91) AddVV/10-8 16.2ns ± 5% 12.8ns ± 1% -21.11% (p=0.000 n=97+86) AddVV/100-8 148ns ± 1% 117ns ± 5% -21.07% (p=0.000 n=66+98) AddVV/1000-8 1.41µs ± 4% 1.13µs ± 3% -19.90% (p=0.000 n=97+97) AddVV/10000-8 14.2µs ± 5% 11.2µs ± 1% -20.82% (p=0.000 n=99+84) AddVV/100000-8 142µs ± 4% 113µs ± 4% -20.40% (p=0.000 n=91+92) SubVV/1-8 5.29ns ± 1% 5.11ns ± 0% -3.30% (p=0.000 n=87+88) SubVV/2-8 6.36ns ± 4% 6.33ns ± 2% -0.56% (p=0.002 n=98+73) SubVV/3-8 7.58ns ± 5% 6.98ns ± 4% -8.01% (p=0.000 n=97+91) SubVV/4-8 8.61ns ± 3% 7.98ns ± 2% -7.31% (p=0.000 n=95+83) SubVV/5-8 10.6ns ± 2% 8.5ns ± 1% -19.56% (p=0.000 n=79+89) SubVV/10-8 16.3ns ± 4% 12.7ns ± 1% -21.97% (p=0.000 n=98+82) SubVV/100-8 124ns ± 1% 118ns ± 1% -4.83% (p=0.000 n=85+81) SubVV/1000-8 1.14µs ± 5% 1.12µs ± 2% -1.17% (p=0.000 n=97+81) SubVV/10000-8 11.6µs ±10% 11.2µs ± 1% -3.39% (p=0.000 n=100+84) SubVV/100000-8 114µs ± 6% 114µs ± 5% ~ (p=0.396 n=83+94) AddVW/1-8 4.04ns ± 4% 4.34ns ± 4% +7.57% (p=0.000 n=96+98) AddVW/2-8 4.34ns ± 5% 4.40ns ± 5% +1.40% (p=0.000 n=99+98) AddVW/3-8 5.43ns ± 0% 5.54ns ± 2% +1.97% (p=0.000 n=85+94) AddVW/4-8 6.23ns ± 1% 6.18ns ± 2% -0.66% (p=0.000 n=77+78) AddVW/5-8 6.78ns ± 2% 6.90ns ± 4% +1.77% (p=0.000 n=80+99) AddVW/10-8 10.5ns ± 4% 9.9ns ± 1% -5.77% (p=0.000 n=97+69) AddVW/100-8 114ns ± 3% 91ns ± 0% -20.38% (p=0.000 n=98+77) AddVW/1000-8 1.12µs ± 1% 0.87µs ± 1% -22.80% (p=0.000 n=82+68) AddVW/10000-8 11.2µs ± 2% 8.5µs ± 5% -23.85% (p=0.000 n=85+100) AddVW/100000-8 112µs ± 2% 85µs ± 5% -24.22% (p=0.000 n=71+96) SubVW/1-8 4.09ns ± 2% 4.18ns ± 4% +2.32% (p=0.000 n=78+96) SubVW/2-8 4.59ns ± 5% 4.52ns ± 7% -1.54% (p=0.000 n=98+94) SubVW/3-8 5.41ns ±10% 5.55ns ± 1% +2.48% (p=0.000 n=100+89) SubVW/4-8 6.51ns ± 2% 6.19ns ± 0% -4.85% (p=0.000 n=97+81) SubVW/5-8 7.25ns ± 3% 6.90ns ± 4% -4.93% (p=0.000 n=97+96) SubVW/10-8 10.6ns ± 4% 9.8ns ± 2% -7.32% (p=0.000 n=95+96) SubVW/100-8 90.4ns ± 0% 90.8ns ± 0% +0.43% (p=0.000 n=83+78) SubVW/1000-8 853ns ± 4% 857ns ± 2% +0.42% (p=0.000 n=100+98) SubVW/10000-8 8.52µs ± 4% 8.53µs ± 2% ~ (p=0.061 n=99+97) SubVW/100000-8 84.8µs ± 5% 84.2µs ± 2% -0.78% (p=0.000 n=99+93) AddMulVVW/1-8 8.73ns ± 0% 5.33ns ± 3% -38.91% (p=0.000 n=91+96) AddMulVVW/2-8 14.8ns ± 3% 6.5ns ± 2% -56.33% (p=0.000 n=100+79) AddMulVVW/3-8 18.6ns ± 2% 7.8ns ± 5% -57.84% (p=0.000 n=89+96) AddMulVVW/4-8 24.0ns ± 2% 9.8ns ± 0% -59.09% (p=0.000 n=95+67) AddMulVVW/5-8 29.0ns ± 2% 11.5ns ± 5% -60.44% (p=0.000 n=90+97) AddMulVVW/10-8 54.1ns ± 0% 18.8ns ± 1% -65.37% (p=0.000 n=82+84) AddMulVVW/100-8 508ns ± 2% 165ns ± 4% -67.62% (p=0.000 n=72+98) AddMulVVW/1000-8 4.96µs ± 3% 1.55µs ± 1% -68.86% (p=0.000 n=99+91) AddMulVVW/10000-8 50.0µs ± 4% 15.5µs ± 4% -68.95% (p=0.000 n=97+97) AddMulVVW/100000-8 491µs ± 1% 156µs ± 8% -68.22% (p=0.000 n=79+95) Change-Id: I4c6ae0b4065f371aea8103f6a85d9e9274bf01d0 Reviewed-on: https://go-review.googlesource.com/c/go/+/164965 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org> |
|
|
|
87cc56718a |
math/big: optimize shlVU_g and shrVU_g
Special case shifts by zero. Provide hints to the compiler that shifts are bounded. There are no existing benchmarks for shifts, but the Float implementation uses shifts, so we can use those. Benchmarks on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta FloatString/100-8 869ns ± 3% 872ns ± 4% +0.40% (p=0.001 n=94+83) FloatString/1000-8 26.5µs ± 1% 26.4µs ± 1% -0.46% (p=0.000 n=87+96) FloatString/10000-8 2.18ms ± 2% 2.18ms ± 2% ~ (p=0.687 n=90+89) FloatString/100000-8 200ms ± 7% 197ms ± 5% -1.47% (p=0.000 n=100+90) FloatAdd/10-8 65.9ns ± 4% 64.0ns ± 4% -2.94% (p=0.000 n=92+93) FloatAdd/100-8 71.3ns ± 4% 67.4ns ± 4% -5.51% (p=0.000 n=96+93) FloatAdd/1000-8 128ns ± 1% 121ns ± 0% -5.69% (p=0.000 n=91+80) FloatAdd/10000-8 718ns ± 4% 626ns ± 4% -12.83% (p=0.000 n=99+99) FloatAdd/100000-8 6.43µs ± 3% 5.50µs ± 1% -14.50% (p=0.000 n=98+83) FloatSub/10-8 57.7ns ± 2% 57.0ns ± 4% -1.20% (p=0.000 n=89+96) FloatSub/100-8 59.9ns ± 3% 58.7ns ± 4% -2.10% (p=0.000 n=100+98) FloatSub/1000-8 94.5ns ± 1% 88.6ns ± 0% -6.16% (p=0.000 n=74+70) FloatSub/10000-8 456ns ± 1% 416ns ± 5% -8.83% (p=0.000 n=87+95) FloatSub/100000-8 4.00µs ± 1% 3.57µs ± 1% -10.87% (p=0.000 n=68+85) FloatSqrt/64-8 585ns ± 1% 579ns ± 1% -0.99% (p=0.000 n=92+90) FloatSqrt/128-8 1.26µs ± 1% 1.23µs ± 2% -2.42% (p=0.000 n=91+81) FloatSqrt/256-8 1.45µs ± 3% 1.40µs ± 1% -3.61% (p=0.000 n=96+90) FloatSqrt/1000-8 4.03µs ± 1% 3.91µs ± 1% -3.05% (p=0.000 n=90+93) FloatSqrt/10000-8 48.0µs ± 0% 47.3µs ± 1% -1.55% (p=0.000 n=90+90) FloatSqrt/100000-8 1.23ms ± 3% 1.22ms ± 4% -1.00% (p=0.000 n=99+99) FloatSqrt/1000000-8 96.7ms ± 4% 98.0ms ±10% ~ (p=0.322 n=89+99) Change-Id: I0f941c05b7c324256d7f0674559b6ba906e92ba8 Reviewed-on: https://go-review.googlesource.com/c/go/+/164967 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org> |
|
|
|
f28191340e |
math/big: fix a formula used as documentation
The function documentation was wrong, it was using a wrong parameter. This change replaces it with the right parameter. The wrong formula was: q = (u1<<_W + u0 - r)/y The function has got a parameter "v" (of type Word), not a parameter "y". So, the right formula is: q = (u1<<_W + u0 - r)/v Fixes #28444 Change-Id: I82e57ba014735a9fdb6262874ddf498754d30d33 Reviewed-on: https://go-review.googlesource.com/c/145280 Reviewed-by: Robert Griesemer <gri@golang.org> |
|
|
|
70ea0ec30f |
math/big: replace local versions of bitLen, nlz with math/bits versions
Verified that BenchmarkBitLen time went down from 2.25 ns/op to 0.65 ns/op an a 2.3 GHz Intel Core i7, before removing that benchmark (now covered by math/bits benchmarks). Change-Id: I3890bb7d1889e95b9a94bd68f0bdf06f1885adeb Reviewed-on: https://go-review.googlesource.com/38464 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> |
|
|
|
322fff8ac8 |
math/big: use math/bits where appropriate
This change adds math/bits as a new dependency of math/big. - use bits.LeadingZeroes instead of local implementation (they are identical, so there's no performance loss here) - leave other functionality local (ntz, bitLen) since there's faster implementations in math/big at the moment Change-Id: I1218aa8a1df0cc9783583b090a4bb5a8a145c4a2 Reviewed-on: https://go-review.googlesource.com/37141 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> |
|
|
|
174058038c |
math/big: define Word as uint instead of uintptr
For compatibility with math/bits uint operations. When math/big was written originally, the Go compiler used 32bit int/uint values even on a 64bit machine. uintptr was the type that represented the machine register size. Now, the int/uint types are sized to the native machine register size, so they are the natural machine Word type. On most machines, the size of int/uint correspond to the size of uintptr. On platforms where uint and uintptr have different sizes, this change may lead to performance differences (e.g., amd64p32). Change-Id: Ief249c160b707b6441848f20041e32e9e9d8d8ca Reviewed-on: https://go-review.googlesource.com/37372 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> |
|
|
|
635cd91eb4 |
math/big: more cleanups (msbxx, nlzxx functions)
Change-Id: Ibace718452b6dc029c5af5240117f5fc794c38cf Reviewed-on: https://go-review.googlesource.com/10388 Reviewed-by: Alan Donovan <adonovan@google.com> |
|
|
|
91f2db3c57 |
math/big: test that subVW and addVW work with arbitrary y
Fixes #10525. Change-Id: I92dc87f5d6db396d8dde2220fc37b7093b772d81 Reviewed-on: https://go-review.googlesource.com/9210 Reviewed-by: Robert Griesemer <gri@golang.org> |
|
|
|
067acd51b0 |
math/big: faster "pure Go" addition/subtraction for long vectors
(platforms w/o corresponding assembly kernels) For short vector adds there's some erradic slow-down, but overall these routines have become significantly faster. This only matters for platforms w/o native (assembly) versions of these kernels, so we are not concerned about the minor slow-down for short vectors. This code was already reviewed under Mercurial (golang.org/cl/172810043) but wasn't submitted before the switch to git. Benchmarks run on 2.3GHz Intel Core i7, running OS X 10.9.5, with the respective AddVV and AddVW assembly routines disabled. benchmark old ns/op new ns/op delta BenchmarkAddVV_1 6.59 7.09 +7.59% BenchmarkAddVV_2 10.3 10.1 -1.94% BenchmarkAddVV_3 10.9 12.6 +15.60% BenchmarkAddVV_4 13.9 15.6 +12.23% BenchmarkAddVV_5 16.8 17.3 +2.98% BenchmarkAddVV_1e1 29.5 29.9 +1.36% BenchmarkAddVV_1e2 246 232 -5.69% BenchmarkAddVV_1e3 2374 2185 -7.96% BenchmarkAddVV_1e4 58942 22292 -62.18% BenchmarkAddVV_1e5 668622 225279 -66.31% BenchmarkAddVW_1 6.81 5.58 -18.06% BenchmarkAddVW_2 7.69 6.86 -10.79% BenchmarkAddVW_3 9.56 8.32 -12.97% BenchmarkAddVW_4 12.1 9.53 -21.24% BenchmarkAddVW_5 13.2 10.9 -17.42% BenchmarkAddVW_1e1 23.4 18.0 -23.08% BenchmarkAddVW_1e2 175 141 -19.43% BenchmarkAddVW_1e3 1568 1266 -19.26% BenchmarkAddVW_1e4 15425 12596 -18.34% BenchmarkAddVW_1e5 156737 133539 -14.80% BenchmarkFibo 381678466 132958666 -65.16% benchmark old MB/s new MB/s speedup BenchmarkAddVV_1 9715.25 9028.30 0.93x BenchmarkAddVV_2 12461.72 12622.60 1.01x BenchmarkAddVV_3 17549.64 15243.82 0.87x BenchmarkAddVV_4 18392.54 16398.29 0.89x BenchmarkAddVV_5 18995.23 18496.57 0.97x BenchmarkAddVV_1e1 21708.98 21438.28 0.99x BenchmarkAddVV_1e2 25956.53 27506.88 1.06x BenchmarkAddVV_1e3 26947.93 29286.66 1.09x BenchmarkAddVV_1e4 10857.96 28709.46 2.64x BenchmarkAddVV_1e5 9571.91 28409.21 2.97x BenchmarkAddVW_1 1175.28 1433.98 1.22x BenchmarkAddVW_2 2080.01 2332.54 1.12x BenchmarkAddVW_3 2509.28 2883.97 1.15x BenchmarkAddVW_4 2646.09 3356.83 1.27x BenchmarkAddVW_5 3020.69 3671.07 1.22x BenchmarkAddVW_1e1 3425.76 4441.40 1.30x BenchmarkAddVW_1e2 4553.17 5642.96 1.24x BenchmarkAddVW_1e3 5100.14 6318.72 1.24x BenchmarkAddVW_1e4 5186.15 6350.96 1.22x BenchmarkAddVW_1e5 5104.07 5990.74 1.17x Change-Id: I7a62023b1105248a0e85e5b9819d3fd4266123d4 Reviewed-on: https://go-review.googlesource.com/2480 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Alan Donovan <adonovan@google.com> |
|
|
|
c007ce824d |
build: move package sources from src/pkg to src
Preparation was in CL 134570043. This CL contains only the effect of 'hg mv src/pkg/* src'. For more about the move, see golang.org/s/go14nopkg. |