Commit Graph

15 Commits

Author SHA1 Message Date
Josh Bleecher Snyder 60b500dc6c math/big: remove bounds checks for shrVU_g inner loop
Make explicit a shrVU_g precondition.
Replace i with i+1 throughout the loop.
The resulting loop is functionally identical,
but the compiler can do better BCE without the i-1 slice offset.

Benchmarks results on amd64 with -tags=math_big_pure_go.

name                          old time/op  new time/op  delta
NonZeroShifts/1/shrVU-8       4.55ns ± 2%  4.45ns ± 3%   -2.27%  (p=0.000 n=28+30)
NonZeroShifts/1/shlVU-8       4.07ns ± 1%  4.13ns ± 4%   +1.55%  (p=0.000 n=26+29)
NonZeroShifts/2/shrVU-8       6.12ns ± 1%  5.55ns ± 1%   -9.30%  (p=0.000 n=28+28)
NonZeroShifts/2/shlVU-8       5.65ns ± 3%  5.70ns ± 2%   +0.92%  (p=0.008 n=30+29)
NonZeroShifts/3/shrVU-8       7.58ns ± 2%  6.79ns ± 2%  -10.46%  (p=0.000 n=28+28)
NonZeroShifts/3/shlVU-8       6.62ns ± 2%  6.69ns ± 1%   +1.07%  (p=0.000 n=29+28)
NonZeroShifts/4/shrVU-8       9.02ns ± 1%  7.79ns ± 2%  -13.59%  (p=0.000 n=27+30)
NonZeroShifts/4/shlVU-8       7.74ns ± 1%  7.82ns ± 1%   +0.92%  (p=0.000 n=26+28)
NonZeroShifts/5/shrVU-8       10.6ns ± 1%   8.9ns ± 3%  -16.31%  (p=0.000 n=25+29)
NonZeroShifts/5/shlVU-8       8.59ns ± 1%  8.68ns ± 1%   +1.13%  (p=0.000 n=27+29)
NonZeroShifts/10/shrVU-8      18.2ns ± 2%  14.4ns ± 1%  -20.96%  (p=0.000 n=27+28)
NonZeroShifts/10/shlVU-8      14.1ns ± 1%  14.1ns ± 1%   +0.46%  (p=0.001 n=26+28)
NonZeroShifts/100/shrVU-8      161ns ± 2%   118ns ± 1%  -26.83%  (p=0.000 n=29+30)
NonZeroShifts/100/shlVU-8      119ns ± 2%   120ns ± 2%   +0.92%  (p=0.000 n=29+29)
NonZeroShifts/1000/shrVU-8    1.54µs ± 1%  1.10µs ± 1%  -28.63%  (p=0.000 n=29+29)
NonZeroShifts/1000/shlVU-8    1.10µs ± 1%  1.10µs ± 2%     ~     (p=0.701 n=28+29)
NonZeroShifts/10000/shrVU-8   15.3µs ± 2%  10.9µs ± 1%  -28.68%  (p=0.000 n=28+28)
NonZeroShifts/10000/shlVU-8   10.9µs ± 2%  10.9µs ± 2%   -0.57%  (p=0.003 n=26+29)
NonZeroShifts/100000/shrVU-8   154µs ± 1%   111µs ± 2%  -28.04%  (p=0.000 n=27+28)
NonZeroShifts/100000/shlVU-8   113µs ± 2%   113µs ± 2%     ~     (p=0.790 n=30+30)

Change-Id: Ib6a621ee7c88b27f0f18121fb2cba3606c40c9b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/297049
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2021-03-05 06:15:22 +00:00
SparrowLii d54a9a9c42 math/big: replace division with multiplication by reciprocal word
Division is much slower than multiplication. And the method of using
multiplication by multiplying reciprocal and replacing division with it
can increase the speed of divWVW algorithm by three times,and at the
same time increase the speed of nats division.

The benchmark test on arm64 is as follows:
name                     old time/op    new time/op    delta
DivWVW/1-4                 13.1ns ± 4%    13.3ns ± 4%      ~     (p=0.444 n=5+5)
DivWVW/2-4                 48.6ns ± 1%    51.2ns ± 2%    +5.39%  (p=0.008 n=5+5)
DivWVW/3-4                 82.0ns ± 1%    69.7ns ± 1%   -15.03%  (p=0.008 n=5+5)
DivWVW/4-4                  116ns ± 1%      71ns ± 2%   -38.88%  (p=0.008 n=5+5)
DivWVW/5-4                  152ns ± 1%      84ns ± 4%   -44.70%  (p=0.008 n=5+5)
DivWVW/10-4                 319ns ± 1%     155ns ± 4%   -51.50%  (p=0.008 n=5+5)
DivWVW/100-4               3.44µs ± 3%    1.30µs ± 8%   -62.30%  (p=0.008 n=5+5)
DivWVW/1000-4              33.8µs ± 0%    10.9µs ± 1%   -67.74%  (p=0.008 n=5+5)
DivWVW/10000-4              343µs ± 4%     111µs ± 5%   -67.63%  (p=0.008 n=5+5)
DivWVW/100000-4            3.35ms ± 1%    1.25ms ± 3%   -62.79%  (p=0.008 n=5+5)
QuoRem-4                   3.08µs ± 2%    2.21µs ± 4%   -28.40%  (p=0.008 n=5+5)
ModSqrt225_Tonelli-4        444µs ± 2%     457µs ± 3%      ~     (p=0.095 n=5+5)
ModSqrt225_3Mod4-4          136µs ± 1%     138µs ± 3%      ~     (p=0.151 n=5+5)
ModSqrt231_Tonelli-4        473µs ± 3%     483µs ± 4%      ~     (p=0.548 n=5+5)
ModSqrt231_5Mod8-4          164µs ± 9%     169µs ±12%      ~     (p=0.421 n=5+5)
Sqrt-4                     36.8µs ± 1%    28.6µs ± 0%   -22.17%  (p=0.016 n=5+4)
Div/20/10-4                50.0ns ± 3%    51.3ns ± 6%      ~     (p=0.238 n=5+5)
Div/40/20-4                49.8ns ± 2%    51.3ns ± 6%      ~     (p=0.222 n=5+5)
Div/100/50-4               85.8ns ± 4%    86.5ns ± 5%	   ~     (p=0.246 n=5+5)
Div/200/100-4               335ns ± 3%     296ns ± 2%   -11.60%  (p=0.008 n=5+5)
Div/400/200-4               442ns ± 2%     359ns ± 5%   -18.81%  (p=0.008 n=5+5)
Div/1000/500-4              858ns ± 3%     643ns ± 6%   -25.06%  (p=0.008 n=5+5)
Div/2000/1000-4            1.70µs ± 3%    1.28µs ± 4%   -24.80%  (p=0.008 n=5+5)
Div/20000/10000-4          45.0µs ± 5%    41.8µs ± 4%    -7.17%  (p=0.016 n=5+5)
Div/200000/100000-4        1.51ms ± 7%    1.43ms ± 3%    -5.42%  (p=0.016 n=5+5)
Div/2000000/1000000-4      57.6ms ± 4%    57.5ms ± 3%      ~     (p=1.000 n=5+5)
Div/20000000/10000000-4     2.08s ± 3%     2.04s ± 1%      ~     (p=0.095 n=5+5)

name                     old speed      new speed      delta
DivWVW/1-4               4.87GB/s ± 4%  4.80GB/s ± 4%      ~     (p=0.310 n=5+5)
DivWVW/2-4               2.63GB/s ± 1%  2.50GB/s ± 2%    -5.07%  (p=0.008 n=5+5)
DivWVW/3-4               2.34GB/s ± 1%  2.76GB/s ± 1%   +17.70%  (p=0.008 n=5+5)
DivWVW/4-4               2.21GB/s ± 1%  3.61GB/s ± 2%   +63.42%  (p=0.008 n=5+5)
DivWVW/5-4               2.10GB/s ± 2%  3.81GB/s ± 4%   +80.89%  (p=0.008 n=5+5)
DivWVW/10-4              2.01GB/s ± 0%  4.13GB/s ± 4%  +105.91%  (p=0.008 n=5+5)
DivWVW/100-4             1.86GB/s ± 2%  4.95GB/s ± 7%  +165.63%  (p=0.008 n=5+5)
DivWVW/1000-4            1.89GB/s ± 0%  5.86GB/s ± 1%  +209.96%  (p=0.008 n=5+5)
DivWVW/10000-4           1.87GB/s ± 4%  5.76GB/s ± 5%  +208.96%  (p=0.008 n=5+5)
DivWVW/100000-4          1.91GB/s ± 1%  5.14GB/s ± 3%  +168.85%  (p=0.008 n=5+5)

Change-Id: I049f1196562b20800e6ef8a6493fd147f93ad830
Reviewed-on: https://go-review.googlesource.com/c/go/+/250417
Trust: Giovanni Bajo <rasky@develer.com>
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Giovanni Bajo <rasky@develer.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2020-09-23 21:55:55 +00:00
Neven Sajko 964fe4b80f math/big: simplify shlVU_g and shrVU_g
Rewrote a few lines to be more idiomatic/less assembly-ish.

Benchmarked with `go test -bench Float -tags math_big_pure_go`:

name                  old time/op    new time/op    delta
FloatString/100-8        751ns ± 0%     746ns ± 1%  -0.71%  (p=0.000 n=10+10)
FloatString/1000-8      22.9µs ± 0%    22.9µs ± 0%    ~     (p=0.271 n=10+10)
FloatString/10000-8     1.89ms ± 0%    1.89ms ± 0%    ~     (p=0.481 n=10+10)
FloatString/100000-8     184ms ± 0%     184ms ± 0%    ~     (p=0.094 n=9+9)
FloatAdd/10-8           56.4ns ± 1%    56.5ns ± 0%    ~     (p=0.170 n=9+9)
FloatAdd/100-8          59.7ns ± 0%    59.3ns ± 0%  -0.70%  (p=0.000 n=8+9)
FloatAdd/1000-8          101ns ± 0%      99ns ± 0%  -1.89%  (p=0.000 n=8+8)
FloatAdd/10000-8         553ns ± 0%     536ns ± 0%  -3.00%  (p=0.000 n=9+10)
FloatAdd/100000-8       4.94µs ± 0%    4.74µs ± 0%  -3.94%  (p=0.000 n=9+10)
FloatSub/10-8           50.3ns ± 0%    50.5ns ± 0%  +0.52%  (p=0.000 n=8+8)
FloatSub/100-8          52.0ns ± 0%    52.2ns ± 1%  +0.46%  (p=0.012 n=8+10)
FloatSub/1000-8         77.9ns ± 0%    77.3ns ± 0%  -0.80%  (p=0.000 n=7+8)
FloatSub/10000-8         371ns ± 0%     362ns ± 0%  -2.67%  (p=0.000 n=10+10)
FloatSub/100000-8       3.20µs ± 0%    3.10µs ± 0%  -3.16%  (p=0.000 n=10+10)
ParseFloatSmallExp-8    7.84µs ± 0%    7.82µs ± 0%  -0.17%  (p=0.037 n=9+9)
ParseFloatLargeExp-8    29.3µs ± 1%    29.5µs ± 0%    ~     (p=0.059 n=9+8)
FloatSqrt/64-8           516ns ± 0%     519ns ± 0%  +0.54%  (p=0.000 n=9+9)
FloatSqrt/128-8         1.07µs ± 0%    1.07µs ± 0%    ~     (p=0.109 n=8+9)
FloatSqrt/256-8         1.23µs ± 0%    1.23µs ± 0%  +0.50%  (p=0.000 n=9+9)
FloatSqrt/1000-8        3.43µs ± 0%    3.44µs ± 0%  +0.53%  (p=0.000 n=9+8)
FloatSqrt/10000-8       40.9µs ± 0%    40.7µs ± 0%  -0.39%  (p=0.000 n=9+8)
FloatSqrt/100000-8      1.07ms ± 0%    1.07ms ± 0%  -0.10%  (p=0.017 n=10+9)
FloatSqrt/1000000-8     89.3ms ± 0%    89.2ms ± 0%  -0.07%  (p=0.015 n=9+8)

Change-Id: Ibf07c6142719d11bc7f329246957d87a9f3ba3d2
GitHub-Last-Rev: 870a041ab7
GitHub-Pull-Request: golang/go#31220
Reviewed-on: https://go-review.googlesource.com/c/go/+/170449
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2019-04-04 00:26:24 +00:00
Josh Bleecher Snyder fe24837c4d math/big: add fast path for pure Go addVW for large z
In the normal case, only a few words have to be updated when adding a word to a vector.
When that happens, we can simply copy the rest of the words, which is much faster.
However, the overhead of that makes it prohibitive for small vectors,
so we check the size at the beginning.

The implementation is a bit weird to allow addVW to continued to be inlined; see #30548.

The AddVW benchmarks are surprising, but fully repeatable.
The SubVW benchmarks are more or less as expected.
I expect that removing the indirect function call will
help both and make them a bit more normal.

name            old time/op    new time/op     delta
AddVW/1-8         4.27ns ± 2%     3.81ns ± 3%   -10.83%  (p=0.000 n=89+90)
AddVW/2-8         4.91ns ± 2%     4.34ns ± 1%   -11.60%  (p=0.000 n=83+90)
AddVW/3-8         5.77ns ± 4%     5.76ns ± 2%      ~     (p=0.365 n=91+87)
AddVW/4-8         6.03ns ± 1%     6.03ns ± 1%      ~     (p=0.392 n=80+76)
AddVW/5-8         6.48ns ± 2%     6.63ns ± 1%    +2.27%  (p=0.000 n=76+74)
AddVW/10-8        9.56ns ± 2%     9.56ns ± 1%    -0.02%  (p=0.002 n=69+76)
AddVW/100-8       90.6ns ± 0%     18.1ns ± 4%   -79.99%  (p=0.000 n=72+94)
AddVW/1000-8       865ns ± 0%       85ns ± 6%   -90.14%  (p=0.000 n=66+96)
AddVW/10000-8     8.57µs ± 2%     1.82µs ± 3%   -78.73%  (p=0.000 n=99+94)
AddVW/100000-8    84.4µs ± 2%     31.8µs ± 4%   -62.29%  (p=0.000 n=93+98)

name            old time/op    new time/op     delta
SubVW/1-8         3.90ns ± 2%     4.13ns ± 4%    +6.02%  (p=0.000 n=92+95)
SubVW/2-8         4.15ns ± 1%     5.20ns ± 1%   +25.22%  (p=0.000 n=83+85)
SubVW/3-8         5.50ns ± 2%     6.22ns ± 6%   +13.21%  (p=0.000 n=91+97)
SubVW/4-8         5.99ns ± 1%     6.63ns ± 1%   +10.63%  (p=0.000 n=79+61)
SubVW/5-8         6.75ns ± 4%     6.88ns ± 2%    +1.82%  (p=0.000 n=98+73)
SubVW/10-8        9.57ns ± 1%     9.56ns ± 1%    -0.13%  (p=0.000 n=77+64)
SubVW/100-8       90.3ns ± 1%     18.1ns ± 2%   -80.00%  (p=0.000 n=75+94)
SubVW/1000-8       860ns ± 4%       85ns ± 7%   -90.14%  (p=0.000 n=97+99)
SubVW/10000-8     8.51µs ± 3%     1.77µs ± 6%   -79.21%  (p=0.000 n=100+97)
SubVW/100000-8    84.4µs ± 3%     31.5µs ± 3%   -62.66%  (p=0.000 n=92+92)

Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937
Reviewed-on: https://go-review.googlesource.com/c/go/+/164968
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2019-03-09 20:33:46 +00:00
Josh Bleecher Snyder 4c227a091e math/big: remove bounds checks in pure Go implementations
These routines are quite sensitive to BCE.

This change eliminates bounds checks from loops.
It does so at the cost of a bit of safety:
malformed input will now return incorrect answers
instead of panicking.

This isn't as bad as it sounds: math/big has very good
test coverage, and the alternative implementations are in
assembly, which could do much worse things with malformed input.

If the compiler's BCE improves, so could these routines.

Notable BCE improvements for these routines would be:

* Allowing and propagating more cross-slice length hints.
  Then hints like _ = y[:len(z)] would eliminate bounds checks for y[i].

* Propagating enough information so that we could do
  n := len(x)
  if len(z) < n {
    n = len(z)
  }
  and then have i < n eliminate the same bounds checks as
  i < len(x) && i < len(z) currently does.

* Providing some way to do BCE for unrolled loops.
  Now that we have math/bits implementations,
  it is possible to write things like ADC chains in
  pure Go, if you can reasonably unroll loops.

Benchmarks below are for amd64, using -tags=math_big_pure_go.

name            old time/op    new time/op    delta
AddVV/1-8         5.15ns ± 3%    4.65ns ± 4%   -9.81%  (p=0.000 n=93+86)
AddVV/2-8         6.40ns ± 2%    5.58ns ± 4%  -12.78%  (p=0.000 n=90+95)
AddVV/3-8         7.07ns ± 2%    6.66ns ± 2%   -5.88%  (p=0.000 n=87+83)
AddVV/4-8         7.94ns ± 5%    7.41ns ± 4%   -6.65%  (p=0.000 n=94+98)
AddVV/5-8         8.55ns ± 1%    8.80ns ± 0%   +2.92%  (p=0.000 n=87+92)
AddVV/10-8        12.7ns ± 1%    12.3ns ± 1%   -3.12%  (p=0.000 n=83+71)
AddVV/100-8        119ns ± 5%     117ns ± 4%   -1.60%  (p=0.000 n=93+90)
AddVV/1000-8      1.14µs ± 4%    1.14µs ± 5%     ~     (p=0.812 n=95+91)
AddVV/10000-8     11.4µs ± 5%    11.3µs ± 5%     ~     (p=0.503 n=97+96)
AddVV/100000-8     114µs ± 4%     113µs ± 5%   -0.98%  (p=0.002 n=97+90)

name            old time/op    new time/op    delta
SubVV/1-8         5.23ns ± 5%    4.65ns ± 3%  -11.18%  (p=0.000 n=89+91)
SubVV/2-8         6.49ns ± 5%    5.58ns ± 3%  -14.04%  (p=0.000 n=92+94)
SubVV/3-8         7.10ns ± 3%    6.65ns ± 2%   -6.28%  (p=0.000 n=87+80)
SubVV/4-8         8.04ns ± 1%    7.44ns ± 5%   -7.49%  (p=0.000 n=83+98)
SubVV/5-8         8.55ns ± 2%    8.32ns ± 1%   -2.75%  (p=0.000 n=84+92)
SubVV/10-8        12.7ns ± 1%    12.3ns ± 1%   -3.09%  (p=0.000 n=80+75)
SubVV/100-8        119ns ± 0%     116ns ± 3%   -1.83%  (p=0.000 n=87+98)
SubVV/1000-8      1.13µs ± 5%    1.13µs ± 3%     ~     (p=0.082 n=96+98)
SubVV/10000-8     11.2µs ± 1%    11.3µs ± 3%   +0.76%  (p=0.000 n=87+97)
SubVV/100000-8     112µs ± 2%     113µs ± 3%   +0.55%  (p=0.000 n=76+88)

name            old time/op    new time/op    delta
AddVW/1-8         4.30ns ± 4%    3.96ns ± 6%  -8.02%  (p=0.000 n=89+97)
AddVW/2-8         5.15ns ± 2%    4.91ns ± 1%  -4.56%  (p=0.000 n=87+80)
AddVW/3-8         5.59ns ± 3%    5.75ns ± 2%  +2.91%  (p=0.000 n=91+88)
AddVW/4-8         6.20ns ± 1%    6.03ns ± 1%  -2.71%  (p=0.000 n=75+90)
AddVW/5-8         6.93ns ± 3%    6.49ns ± 2%  -6.35%  (p=0.000 n=100+82)
AddVW/10-8        10.0ns ± 7%     9.6ns ± 0%  -4.02%  (p=0.000 n=98+74)
AddVW/100-8       91.1ns ± 1%    90.6ns ± 1%  -0.55%  (p=0.000 n=84+80)
AddVW/1000-8       866ns ± 1%     856ns ± 4%  -1.06%  (p=0.000 n=69+96)
AddVW/10000-8     8.64µs ± 1%    8.53µs ± 4%  -1.25%  (p=0.000 n=67+99)
AddVW/100000-8    84.3µs ± 2%    85.4µs ± 4%  +1.22%  (p=0.000 n=89+99)

name            old time/op    new time/op    delta
SubVW/1-8         4.28ns ± 2%    3.82ns ± 3%  -10.63%  (p=0.000 n=91+89)
SubVW/2-8         4.61ns ± 1%    4.48ns ± 3%   -2.67%  (p=0.000 n=94+96)
SubVW/3-8         5.54ns ± 1%    5.81ns ± 4%   +4.87%  (p=0.000 n=92+97)
SubVW/4-8         6.20ns ± 1%    6.08ns ± 2%   -1.99%  (p=0.000 n=71+88)
SubVW/5-8         6.91ns ± 3%    6.64ns ± 1%   -3.90%  (p=0.000 n=97+70)
SubVW/10-8        9.85ns ± 2%    9.62ns ± 0%   -2.31%  (p=0.000 n=82+62)
SubVW/100-8       91.1ns ± 1%    90.9ns ± 3%   -0.14%  (p=0.010 n=71+93)
SubVW/1000-8       859ns ± 3%     867ns ± 1%   +0.98%  (p=0.000 n=99+78)
SubVW/10000-8     8.54µs ± 5%    8.57µs ± 2%   +0.38%  (p=0.007 n=98+92)
SubVW/100000-8    84.5µs ± 3%    84.6µs ± 3%     ~     (p=0.334 n=95+94)

name                old time/op    new time/op    delta
AddMulVVW/1-8         5.43ns ± 3%    4.36ns ± 2%  -19.67%  (p=0.000 n=95+94)
AddMulVVW/2-8         6.56ns ± 4%    6.11ns ± 1%   -6.90%  (p=0.000 n=91+91)
AddMulVVW/3-8         8.00ns ± 1%    7.80ns ± 4%   -2.52%  (p=0.000 n=83+95)
AddMulVVW/4-8         9.81ns ± 2%    9.53ns ± 1%   -2.86%  (p=0.000 n=77+64)
AddMulVVW/5-8         11.4ns ± 3%    11.3ns ± 5%   -0.89%  (p=0.000 n=95+97)
AddMulVVW/10-8        18.9ns ± 5%    19.1ns ± 5%   +0.89%  (p=0.000 n=91+94)
AddMulVVW/100-8        165ns ± 5%     165ns ± 4%     ~     (p=0.427 n=97+98)
AddMulVVW/1000-8      1.56µs ± 3%    1.56µs ± 4%     ~     (p=0.167 n=98+96)
AddMulVVW/10000-8     15.7µs ± 5%    15.6µs ± 5%   -0.31%  (p=0.044 n=95+97)
AddMulVVW/100000-8     156µs ± 3%     157µs ± 8%     ~     (p=0.373 n=72+99)

Change-Id: Ibc720785d5b95f6a797103b1363843205f4d56bf
Reviewed-on: https://go-review.googlesource.com/c/go/+/164966
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
2019-03-09 20:33:13 +00:00
Josh Bleecher Snyder d5edbcac98 math/big: rewrite pure Go implementations to use math/bits
While we're here, delete addWW_g and subWW_g, per the TODO.
They are now obsolete.

Benchmarks on amd64 with -tags=math_big_pure_go.

name                old time/op    new time/op     delta
AddVV/1-8             5.24ns ± 2%     5.12ns ± 1%    -2.11%  (p=0.000 n=82+87)
AddVV/2-8             6.44ns ± 1%     6.33ns ± 2%    -1.82%  (p=0.000 n=77+82)
AddVV/3-8             7.89ns ± 8%     6.97ns ± 4%   -11.71%  (p=0.000 n=100+96)
AddVV/4-8             8.60ns ± 0%     7.72ns ± 4%   -10.24%  (p=0.000 n=90+96)
AddVV/5-8             10.3ns ± 4%      8.5ns ± 1%   -17.02%  (p=0.000 n=96+91)
AddVV/10-8            16.2ns ± 5%     12.8ns ± 1%   -21.11%  (p=0.000 n=97+86)
AddVV/100-8            148ns ± 1%      117ns ± 5%   -21.07%  (p=0.000 n=66+98)
AddVV/1000-8          1.41µs ± 4%     1.13µs ± 3%   -19.90%  (p=0.000 n=97+97)
AddVV/10000-8         14.2µs ± 5%     11.2µs ± 1%   -20.82%  (p=0.000 n=99+84)
AddVV/100000-8         142µs ± 4%      113µs ± 4%   -20.40%  (p=0.000 n=91+92)
SubVV/1-8             5.29ns ± 1%     5.11ns ± 0%    -3.30%  (p=0.000 n=87+88)
SubVV/2-8             6.36ns ± 4%     6.33ns ± 2%    -0.56%  (p=0.002 n=98+73)
SubVV/3-8             7.58ns ± 5%     6.98ns ± 4%    -8.01%  (p=0.000 n=97+91)
SubVV/4-8             8.61ns ± 3%     7.98ns ± 2%    -7.31%  (p=0.000 n=95+83)
SubVV/5-8             10.6ns ± 2%      8.5ns ± 1%   -19.56%  (p=0.000 n=79+89)
SubVV/10-8            16.3ns ± 4%     12.7ns ± 1%   -21.97%  (p=0.000 n=98+82)
SubVV/100-8            124ns ± 1%      118ns ± 1%    -4.83%  (p=0.000 n=85+81)
SubVV/1000-8          1.14µs ± 5%     1.12µs ± 2%    -1.17%  (p=0.000 n=97+81)
SubVV/10000-8         11.6µs ±10%     11.2µs ± 1%    -3.39%  (p=0.000 n=100+84)
SubVV/100000-8         114µs ± 6%      114µs ± 5%      ~     (p=0.396 n=83+94)
AddVW/1-8             4.04ns ± 4%     4.34ns ± 4%    +7.57%  (p=0.000 n=96+98)
AddVW/2-8             4.34ns ± 5%     4.40ns ± 5%    +1.40%  (p=0.000 n=99+98)
AddVW/3-8             5.43ns ± 0%     5.54ns ± 2%    +1.97%  (p=0.000 n=85+94)
AddVW/4-8             6.23ns ± 1%     6.18ns ± 2%    -0.66%  (p=0.000 n=77+78)
AddVW/5-8             6.78ns ± 2%     6.90ns ± 4%    +1.77%  (p=0.000 n=80+99)
AddVW/10-8            10.5ns ± 4%      9.9ns ± 1%    -5.77%  (p=0.000 n=97+69)
AddVW/100-8            114ns ± 3%       91ns ± 0%   -20.38%  (p=0.000 n=98+77)
AddVW/1000-8          1.12µs ± 1%     0.87µs ± 1%   -22.80%  (p=0.000 n=82+68)
AddVW/10000-8         11.2µs ± 2%      8.5µs ± 5%   -23.85%  (p=0.000 n=85+100)
AddVW/100000-8         112µs ± 2%       85µs ± 5%   -24.22%  (p=0.000 n=71+96)
SubVW/1-8             4.09ns ± 2%     4.18ns ± 4%    +2.32%  (p=0.000 n=78+96)
SubVW/2-8             4.59ns ± 5%     4.52ns ± 7%    -1.54%  (p=0.000 n=98+94)
SubVW/3-8             5.41ns ±10%     5.55ns ± 1%    +2.48%  (p=0.000 n=100+89)
SubVW/4-8             6.51ns ± 2%     6.19ns ± 0%    -4.85%  (p=0.000 n=97+81)
SubVW/5-8             7.25ns ± 3%     6.90ns ± 4%    -4.93%  (p=0.000 n=97+96)
SubVW/10-8            10.6ns ± 4%      9.8ns ± 2%    -7.32%  (p=0.000 n=95+96)
SubVW/100-8           90.4ns ± 0%     90.8ns ± 0%    +0.43%  (p=0.000 n=83+78)
SubVW/1000-8           853ns ± 4%      857ns ± 2%    +0.42%  (p=0.000 n=100+98)
SubVW/10000-8         8.52µs ± 4%     8.53µs ± 2%      ~     (p=0.061 n=99+97)
SubVW/100000-8        84.8µs ± 5%     84.2µs ± 2%    -0.78%  (p=0.000 n=99+93)
AddMulVVW/1-8         8.73ns ± 0%     5.33ns ± 3%   -38.91%  (p=0.000 n=91+96)
AddMulVVW/2-8         14.8ns ± 3%      6.5ns ± 2%   -56.33%  (p=0.000 n=100+79)
AddMulVVW/3-8         18.6ns ± 2%      7.8ns ± 5%   -57.84%  (p=0.000 n=89+96)
AddMulVVW/4-8         24.0ns ± 2%      9.8ns ± 0%   -59.09%  (p=0.000 n=95+67)
AddMulVVW/5-8         29.0ns ± 2%     11.5ns ± 5%   -60.44%  (p=0.000 n=90+97)
AddMulVVW/10-8        54.1ns ± 0%     18.8ns ± 1%   -65.37%  (p=0.000 n=82+84)
AddMulVVW/100-8        508ns ± 2%      165ns ± 4%   -67.62%  (p=0.000 n=72+98)
AddMulVVW/1000-8      4.96µs ± 3%     1.55µs ± 1%   -68.86%  (p=0.000 n=99+91)
AddMulVVW/10000-8     50.0µs ± 4%     15.5µs ± 4%   -68.95%  (p=0.000 n=97+97)
AddMulVVW/100000-8     491µs ± 1%      156µs ± 8%   -68.22%  (p=0.000 n=79+95)

Change-Id: I4c6ae0b4065f371aea8103f6a85d9e9274bf01d0
Reviewed-on: https://go-review.googlesource.com/c/go/+/164965
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2019-03-04 20:49:12 +00:00
Josh Bleecher Snyder 87cc56718a math/big: optimize shlVU_g and shrVU_g
Special case shifts by zero.
Provide hints to the compiler that shifts are bounded.

There are no existing benchmarks for shifts,
but the Float implementation uses shifts,
so we can use those.

Benchmarks on amd64 with -tags=math_big_pure_go.

name                  old time/op    new time/op    delta
FloatString/100-8        869ns ± 3%     872ns ± 4%   +0.40%  (p=0.001 n=94+83)
FloatString/1000-8      26.5µs ± 1%    26.4µs ± 1%   -0.46%  (p=0.000 n=87+96)
FloatString/10000-8     2.18ms ± 2%    2.18ms ± 2%     ~     (p=0.687 n=90+89)
FloatString/100000-8     200ms ± 7%     197ms ± 5%   -1.47%  (p=0.000 n=100+90)
FloatAdd/10-8           65.9ns ± 4%    64.0ns ± 4%   -2.94%  (p=0.000 n=92+93)
FloatAdd/100-8          71.3ns ± 4%    67.4ns ± 4%   -5.51%  (p=0.000 n=96+93)
FloatAdd/1000-8          128ns ± 1%     121ns ± 0%   -5.69%  (p=0.000 n=91+80)
FloatAdd/10000-8         718ns ± 4%     626ns ± 4%  -12.83%  (p=0.000 n=99+99)
FloatAdd/100000-8       6.43µs ± 3%    5.50µs ± 1%  -14.50%  (p=0.000 n=98+83)
FloatSub/10-8           57.7ns ± 2%    57.0ns ± 4%   -1.20%  (p=0.000 n=89+96)
FloatSub/100-8          59.9ns ± 3%    58.7ns ± 4%   -2.10%  (p=0.000 n=100+98)
FloatSub/1000-8         94.5ns ± 1%    88.6ns ± 0%   -6.16%  (p=0.000 n=74+70)
FloatSub/10000-8         456ns ± 1%     416ns ± 5%   -8.83%  (p=0.000 n=87+95)
FloatSub/100000-8       4.00µs ± 1%    3.57µs ± 1%  -10.87%  (p=0.000 n=68+85)
FloatSqrt/64-8           585ns ± 1%     579ns ± 1%   -0.99%  (p=0.000 n=92+90)
FloatSqrt/128-8         1.26µs ± 1%    1.23µs ± 2%   -2.42%  (p=0.000 n=91+81)
FloatSqrt/256-8         1.45µs ± 3%    1.40µs ± 1%   -3.61%  (p=0.000 n=96+90)
FloatSqrt/1000-8        4.03µs ± 1%    3.91µs ± 1%   -3.05%  (p=0.000 n=90+93)
FloatSqrt/10000-8       48.0µs ± 0%    47.3µs ± 1%   -1.55%  (p=0.000 n=90+90)
FloatSqrt/100000-8      1.23ms ± 3%    1.22ms ± 4%   -1.00%  (p=0.000 n=99+99)
FloatSqrt/1000000-8     96.7ms ± 4%    98.0ms ±10%     ~     (p=0.322 n=89+99)

Change-Id: I0f941c05b7c324256d7f0674559b6ba906e92ba8
Reviewed-on: https://go-review.googlesource.com/c/go/+/164967
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2019-03-04 19:30:57 +00:00
hearot f28191340e math/big: fix a formula used as documentation
The function documentation was wrong, it was using a wrong parameter. This change
replaces it with the right parameter.

The wrong formula was: q = (u1<<_W + u0 - r)/y
The function has got a parameter "v" (of type Word), not a parameter "y".
So, the right formula is: q = (u1<<_W + u0 - r)/v

Fixes #28444

Change-Id: I82e57ba014735a9fdb6262874ddf498754d30d33
Reviewed-on: https://go-review.googlesource.com/c/145280
Reviewed-by: Robert Griesemer <gri@golang.org>
2018-10-28 16:58:20 +00:00
Robert Griesemer 70ea0ec30f math/big: replace local versions of bitLen, nlz with math/bits versions
Verified that BenchmarkBitLen time went down from 2.25 ns/op to 0.65 ns/op
an a 2.3 GHz Intel Core i7, before removing that benchmark (now covered by
math/bits benchmarks).

Change-Id: I3890bb7d1889e95b9a94bd68f0bdf06f1885adeb
Reviewed-on: https://go-review.googlesource.com/38464
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-03-23 19:43:09 +00:00
Robert Griesemer 322fff8ac8 math/big: use math/bits where appropriate
This change adds math/bits as a new dependency of math/big.

- use bits.LeadingZeroes instead of local implementation
  (they are identical, so there's no performance loss here)

- leave other functionality local (ntz, bitLen) since there's
  faster implementations in math/big at the moment

Change-Id: I1218aa8a1df0cc9783583b090a4bb5a8a145c4a2
Reviewed-on: https://go-review.googlesource.com/37141
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-02-24 19:19:02 +00:00
Robert Griesemer 174058038c math/big: define Word as uint instead of uintptr
For compatibility with math/bits uint operations.

When math/big was written originally, the Go compiler used 32bit
int/uint values even on a 64bit machine. uintptr was the type that
represented the machine register size. Now, the int/uint types are
sized to the native machine register size, so they are the natural
machine Word type.

On most machines, the size of int/uint correspond to the size of
uintptr. On platforms where uint and uintptr have different sizes,
this change may lead to performance differences (e.g., amd64p32).

Change-Id: Ief249c160b707b6441848f20041e32e9e9d8d8ca
Reviewed-on: https://go-review.googlesource.com/37372
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-02-21 19:31:40 +00:00
Robert Griesemer 635cd91eb4 math/big: more cleanups (msbxx, nlzxx functions)
Change-Id: Ibace718452b6dc029c5af5240117f5fc794c38cf
Reviewed-on: https://go-review.googlesource.com/10388
Reviewed-by: Alan Donovan <adonovan@google.com>
2015-05-27 22:10:15 +00:00
Josh Bleecher Snyder 91f2db3c57 math/big: test that subVW and addVW work with arbitrary y
Fixes #10525.

Change-Id: I92dc87f5d6db396d8dde2220fc37b7093b772d81
Reviewed-on: https://go-review.googlesource.com/9210
Reviewed-by: Robert Griesemer <gri@golang.org>
2015-04-21 23:13:33 +00:00
Robert Griesemer 067acd51b0 math/big: faster "pure Go" addition/subtraction for long vectors
(platforms w/o corresponding assembly kernels)

For short vector adds there's some erradic slow-down, but overall
these routines have become significantly faster. This only matters
for platforms w/o native (assembly) versions of these kernels, so
we are not concerned about the minor slow-down for short vectors.

This code was already reviewed under Mercurial (golang.org/cl/172810043)
but wasn't submitted before the switch to git.

Benchmarks run on 2.3GHz Intel Core i7, running OS X 10.9.5,
with the respective AddVV and AddVW assembly routines disabled.

benchmark              old ns/op     new ns/op     delta
BenchmarkAddVV_1       6.59          7.09          +7.59%
BenchmarkAddVV_2       10.3          10.1          -1.94%
BenchmarkAddVV_3       10.9          12.6          +15.60%
BenchmarkAddVV_4       13.9          15.6          +12.23%
BenchmarkAddVV_5       16.8          17.3          +2.98%
BenchmarkAddVV_1e1     29.5          29.9          +1.36%
BenchmarkAddVV_1e2     246           232           -5.69%
BenchmarkAddVV_1e3     2374          2185          -7.96%
BenchmarkAddVV_1e4     58942         22292         -62.18%
BenchmarkAddVV_1e5     668622        225279        -66.31%
BenchmarkAddVW_1       6.81          5.58          -18.06%
BenchmarkAddVW_2       7.69          6.86          -10.79%
BenchmarkAddVW_3       9.56          8.32          -12.97%
BenchmarkAddVW_4       12.1          9.53          -21.24%
BenchmarkAddVW_5       13.2          10.9          -17.42%
BenchmarkAddVW_1e1     23.4          18.0          -23.08%
BenchmarkAddVW_1e2     175           141           -19.43%
BenchmarkAddVW_1e3     1568          1266          -19.26%
BenchmarkAddVW_1e4     15425         12596         -18.34%
BenchmarkAddVW_1e5     156737        133539        -14.80%
BenchmarkFibo          381678466     132958666     -65.16%

benchmark              old MB/s     new MB/s     speedup
BenchmarkAddVV_1       9715.25      9028.30      0.93x
BenchmarkAddVV_2       12461.72     12622.60     1.01x
BenchmarkAddVV_3       17549.64     15243.82     0.87x
BenchmarkAddVV_4       18392.54     16398.29     0.89x
BenchmarkAddVV_5       18995.23     18496.57     0.97x
BenchmarkAddVV_1e1     21708.98     21438.28     0.99x
BenchmarkAddVV_1e2     25956.53     27506.88     1.06x
BenchmarkAddVV_1e3     26947.93     29286.66     1.09x
BenchmarkAddVV_1e4     10857.96     28709.46     2.64x
BenchmarkAddVV_1e5     9571.91      28409.21     2.97x
BenchmarkAddVW_1       1175.28      1433.98      1.22x
BenchmarkAddVW_2       2080.01      2332.54      1.12x
BenchmarkAddVW_3       2509.28      2883.97      1.15x
BenchmarkAddVW_4       2646.09      3356.83      1.27x
BenchmarkAddVW_5       3020.69      3671.07      1.22x
BenchmarkAddVW_1e1     3425.76      4441.40      1.30x
BenchmarkAddVW_1e2     4553.17      5642.96      1.24x
BenchmarkAddVW_1e3     5100.14      6318.72      1.24x
BenchmarkAddVW_1e4     5186.15      6350.96      1.22x
BenchmarkAddVW_1e5     5104.07      5990.74      1.17x

Change-Id: I7a62023b1105248a0e85e5b9819d3fd4266123d4
Reviewed-on: https://go-review.googlesource.com/2480
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Alan Donovan <adonovan@google.com>
2015-01-08 17:00:59 +00:00
Russ Cox c007ce824d build: move package sources from src/pkg to src
Preparation was in CL 134570043.
This CL contains only the effect of 'hg mv src/pkg/* src'.
For more about the move, see golang.org/s/go14nopkg.
2014-09-08 00:08:51 -04:00