Commit Graph

9 Commits

Author SHA1 Message Date
Martin Möhrmann 69972aea74 internal/cpu: new package to detect cpu features
Implements detection of x86 cpu features that
are used in the go standard library.

Changes all standard library packages to use the new cpu package
instead of using runtime internal variables to check x86 cpu features.

Updates: #15403

Change-Id: I2999a10cb4d9ec4863ffbed72f4e021a1dbc4bb9
Reviewed-on: https://go-review.googlesource.com/41476
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-05-10 17:02:21 +00:00
Josselin Costanzi d206af1e6c strings: optimize Count for amd64
Move optimized Count implementation from bytes to runtime. Use in
both bytes and strings packages.
Add CountByte benchmark to strings.

Strings benchmarks:
name                       old time/op    new time/op    delta
CountHard1-4                 226µs ± 1%      226µs ± 2%      ~     (p=0.247 n=10+10)
CountHard2-4                 316µs ± 1%      315µs ± 0%      ~     (p=0.133 n=9+10)
CountHard3-4                 919µs ± 1%      920µs ± 1%      ~     (p=0.968 n=10+9)
CountTorture-4              15.4µs ± 1%     15.7µs ± 1%    +2.47%  (p=0.000 n=10+9)
CountTortureOverlapping-4   9.60ms ± 0%     9.65ms ± 1%      ~     (p=0.247 n=10+10)
CountByte/10-4              26.3ns ± 1%     10.9ns ± 1%   -58.71%  (p=0.000 n=9+9)
CountByte/32-4              42.7ns ± 0%     14.2ns ± 0%   -66.64%  (p=0.000 n=10+10)
CountByte/4096-4            3.07µs ± 0%     0.31µs ± 2%   -89.99%  (p=0.000 n=9+10)
CountByte/4194304-4         3.48ms ± 1%     0.34ms ± 1%   -90.09%  (p=0.000 n=10+9)
CountByte/67108864-4        55.6ms ± 1%      7.0ms ± 0%   -87.49%  (p=0.000 n=9+8)

name                      old speed      new speed       delta
CountByte/10-4             380MB/s ± 1%    919MB/s ± 1%  +142.21%  (p=0.000 n=9+9)
CountByte/32-4             750MB/s ± 0%   2247MB/s ± 0%  +199.62%  (p=0.000 n=10+10)
CountByte/4096-4          1.33GB/s ± 0%  13.32GB/s ± 2%  +898.13%  (p=0.000 n=9+10)
CountByte/4194304-4       1.21GB/s ± 1%  12.17GB/s ± 1%  +908.87%  (p=0.000 n=10+9)
CountByte/67108864-4      1.21GB/s ± 1%   9.65GB/s ± 0%  +699.29%  (p=0.000 n=9+8)

Fixes #19411

Change-Id: I8d2d409f0fa6df6d03b60790aa86e540b4a4e3b0
Reviewed-on: https://go-review.googlesource.com/38693
Reviewed-by: Keith Randall <khr@golang.org>
2017-04-07 14:25:13 +00:00
Josselin Costanzi 0d3cd51c9c bytes: fix typo in comment
Change-Id: Ia739337dc9961422982912cc6a669022559fb991
Reviewed-on: https://go-review.googlesource.com/38365
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-03-22 19:41:54 +00:00
Josselin Costanzi 01cd22c687 bytes: add optimized countByte for amd64
Use SSE/AVX2 when counting a single byte.
Inspired from runtime indexbyte implementation.

Benchmark against previous implementation, where
1 byte in every 8 is the one we are looking for:

* On a machine without AVX2
name               old time/op   new time/op     delta
CountSingle/10-4    61.8ns ±10%     15.6ns ±11%    -74.83%  (p=0.000 n=10+10)
CountSingle/32-4     100ns ± 4%       17ns ±10%    -82.54%  (p=0.000 n=10+9)
CountSingle/4K-4    9.66µs ± 3%     0.37µs ± 6%    -96.21%  (p=0.000 n=10+10)
CountSingle/4M-4    11.0ms ± 6%      0.4ms ± 4%    -96.04%  (p=0.000 n=10+10)
CountSingle/64M-4    194ms ± 8%        8ms ± 2%    -95.64%  (p=0.000 n=10+10)

name               old speed     new speed       delta
CountSingle/10-4   162MB/s ±10%    645MB/s ±10%   +297.00%  (p=0.000 n=10+10)
CountSingle/32-4   321MB/s ± 5%   1844MB/s ± 9%   +474.79%  (p=0.000 n=10+9)
CountSingle/4K-4   424MB/s ± 3%  11169MB/s ± 6%  +2533.10%  (p=0.000 n=10+10)
CountSingle/4M-4   381MB/s ± 7%   9609MB/s ± 4%  +2421.88%  (p=0.000 n=10+10)
CountSingle/64M-4  346MB/s ± 7%   7924MB/s ± 2%  +2188.78%  (p=0.000 n=10+10)

* On a machine with AVX2
name               old time/op   new time/op     delta
CountSingle/10-8    37.1ns ± 3%      8.2ns ± 1%    -77.80%  (p=0.000 n=10+10)
CountSingle/32-8    66.1ns ± 3%      9.8ns ± 2%    -85.23%  (p=0.000 n=10+10)
CountSingle/4K-8    7.36µs ± 3%     0.11µs ± 1%    -98.54%  (p=0.000 n=10+10)
CountSingle/4M-8    7.46ms ± 2%     0.15ms ± 2%    -97.95%  (p=0.000 n=10+9)
CountSingle/64M-8    124ms ± 2%        6ms ± 4%    -95.09%  (p=0.000 n=10+10)

name               old speed     new speed       delta
CountSingle/10-8   269MB/s ± 3%   1213MB/s ± 1%   +350.32%  (p=0.000 n=10+10)
CountSingle/32-8   484MB/s ± 4%   3277MB/s ± 2%   +576.66%  (p=0.000 n=10+10)
CountSingle/4K-8   556MB/s ± 3%  37933MB/s ± 1%  +6718.36%  (p=0.000 n=10+10)
CountSingle/4M-8   562MB/s ± 2%  27444MB/s ± 3%  +4783.43%  (p=0.000 n=10+9)
CountSingle/64M-8  543MB/s ± 2%  11054MB/s ± 3%  +1935.81%  (p=0.000 n=10+10)

Fixes #19411

Change-Id: Ieaf20b1fabccabe767c55c66e242e86f3617f883
Reviewed-on: https://go-review.googlesource.com/38258
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-03-21 20:25:17 +00:00
Michael Munday cfd89164bb all: make copyright headers consistent with one space after period
Continuation of CL 20111.

Change-Id: Ie2f62237e6ec316989c021de9b267cc9d6ee6676
Reviewed-on: https://go-review.googlesource.com/32830
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-11-04 20:46:25 +00:00
Ilya Tocar f31492ffe7 bytes,strings: use IndexByte more often in Index on AMD64
IndexByte+compare is faster than indexShortStr in good case, when
first byte is rare, but is more costly in bad cases.
Start with IndexByte and switch to indexShortStr if we encounter
false positives more often than once per 8 bytes.

Benchmark changes for package bytes:

IndexRune/4K-8                    416ns ± 0%       86ns ± 0%    -79.24%        (p=0.000 n=10+10)
IndexRune/4M-8                    413µs ± 0%      100µs ± 1%    -75.88%        (p=0.000 n=10+10)
IndexRune/64M-8                  6.73ms ± 0%     2.86ms ± 1%    -57.49%        (p=0.000 n=10+10)
Index/10-8                       8.45ns ± 0%     8.96ns ± 0%     +6.04%         (p=0.000 n=9+10)
Index/32-8                       9.64ns ± 0%     9.51ns ± 0%     -1.30%          (p=0.000 n=8+9)
Index/4K-8                       2.11µs ± 0%     2.12µs ± 0%     +0.26%        (p=0.000 n=10+10)
Index/4M-8                       3.60ms ± 5%     3.59ms ± 7%       ~            (p=0.497 n=9+10)
Index/64M-8                      57.1ms ± 3%     58.7ms ± 5%       ~            (p=0.113 n=9+10)
IndexEasy/10-8                   7.10ns ± 1%     7.71ns ± 1%     +8.60%        (p=0.000 n=10+10)
IndexEasy/32-8                   9.29ns ± 1%     9.22ns ± 0%     -0.75%         (p=0.000 n=9+10)
IndexEasy/4K-8                   1.06µs ± 0%     0.08µs ± 0%    -92.18%        (p=0.000 n=10+10)
IndexEasy/4M-8                   1.07ms ± 0%     0.10ms ± 1%    -90.74%         (p=0.000 n=9+10)
IndexEasy/64M-8                  17.3ms ± 0%      2.8ms ± 1%    -83.76%         (p=0.000 n=10+9)

IndexRune/4K-8                 9.84GB/s ± 0%  47.42GB/s ± 0%   +381.85%         (p=0.000 n=8+10)
IndexRune/4M-8                 10.1GB/s ± 0%   42.1GB/s ± 1%   +314.56%        (p=0.000 n=10+10)
IndexRune/64M-8                10.0GB/s ± 0%   23.4GB/s ± 1%   +135.25%        (p=0.000 n=10+10)
Index/10-8                     1.18GB/s ± 0%   1.12GB/s ± 0%     -5.67%         (p=0.000 n=10+9)
Index/32-8                     3.32GB/s ± 0%   3.36GB/s ± 0%     +1.27%         (p=0.000 n=10+9)
Index/4K-8                     1.94GB/s ± 0%   1.93GB/s ± 0%     -0.25%         (p=0.000 n=10+9)
Index/4M-8                     1.17GB/s ± 5%   1.17GB/s ± 7%       ~            (p=0.497 n=9+10)
Index/64M-8                    1.17GB/s ± 3%   1.15GB/s ± 6%       ~            (p=0.113 n=9+10)
IndexEasy/10-8                 1.41GB/s ± 1%   1.30GB/s ± 1%     -7.90%        (p=0.000 n=10+10)
IndexEasy/32-8                 3.45GB/s ± 1%   3.47GB/s ± 0%     +0.73%         (p=0.000 n=9+10)
IndexEasy/4K-8                 3.84GB/s ± 0%  49.16GB/s ± 0%  +1178.78%         (p=0.000 n=9+10)
IndexEasy/4M-8                 3.91GB/s ± 0%  42.19GB/s ± 1%   +980.37%         (p=0.000 n=9+10)
IndexEasy/64M-8                3.88GB/s ± 0%  23.91GB/s ± 1%   +515.76%         (p=0.000 n=10+9)

No significant changes in strings.

In regexp I see:

Match/Easy0/32-8                 536MB/s ± 1%   540MB/s ± 1%    +0.75%         (p=0.001 n=9+10)
Match/Easy0/1K-8                1.62GB/s ± 0%  4.42GB/s ± 1%  +172.48%        (p=0.000 n=10+10)
Match/Easy0/32K-8               1.87GB/s ± 0%  9.07GB/s ± 1%  +384.24%         (p=0.000 n=7+10)
Match/Easy0/1M-8                1.90GB/s ± 0%  4.83GB/s ± 0%  +154.56%         (p=0.000 n=8+10)
Match/Easy0/32M-8               1.90GB/s ± 0%  4.53GB/s ± 0%  +138.62%         (p=0.000 n=7+10)

Compared to in 1.7:

Match/Easy0/32-8                  59.5ns ± 0%    59.2ns ± 1%   -0.45%         (p=0.008 n=9+10)
Match/Easy0/1K-8                   226ns ± 1%     231ns ± 1%   +2.30%        (p=0.000 n=10+10)
Match/Easy0/32K-8                 3.73µs ± 2%    3.61µs ± 1%   -3.12%        (p=0.000 n=10+10)
Match/Easy0/1M-8                   206µs ± 1%     217µs ± 0%   +5.34%        (p=0.000 n=10+10)
Match/Easy0/32M-8                 7.03ms ± 1%    7.40ms ± 0%   +5.23%        (p=0.000 n=10+10)

Fixes #17456

Change-Id: I38b2fabcaed7119cc4bf37007ba7bfe7504c8f9f
Reviewed-on: https://go-review.googlesource.com/31690
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
Reviewed-by: Keith Randall <khr@golang.org>
2016-11-01 18:30:52 +00:00
Ilya Tocar 0cff219c12 strings: use AVX2 for Index if available
IndexHard4-4      1.50ms ± 2%  0.71ms ± 0%  -52.36%  (p=0.000 n=20+19)

This also fixes a bug, that caused a string of length 16 to use
two 8-byte comparisons instead of one 16-byte. And adds a test for
cases when partial_match fails.

Change-Id: I1ee8fc4e068bb36c95c45de78f067c822c0d9df0
Reviewed-on: https://go-review.googlesource.com/22551
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-09-07 10:43:13 +00:00
Hiroshi Ioka e10286aeda bytes: make IndexRune faster
re-implement IndexRune by IndexByte and Index which are well optimized
to get performance gain.

name                  old time/op   new time/op     delta
IndexRune/10-4         53.2ns ± 1%     29.1ns ± 1%    -45.32%  (p=0.008 n=5+5)
IndexRune/32-4          191ns ± 1%       27ns ± 1%    -85.75%  (p=0.008 n=5+5)
IndexRune/4K-4         23.5µs ± 1%      1.0µs ± 1%    -95.77%  (p=0.008 n=5+5)
IndexRune/4M-4         23.8ms ± 0%      1.0ms ± 2%    -95.90%  (p=0.008 n=5+5)
IndexRune/64M-4         384ms ± 1%       15ms ± 1%    -95.98%  (p=0.008 n=5+5)
IndexRuneASCII/10-4    61.5ns ± 0%     10.3ns ± 4%    -83.17%  (p=0.008 n=5+5)
IndexRuneASCII/32-4     203ns ± 0%       11ns ± 5%    -94.68%  (p=0.008 n=5+5)
IndexRuneASCII/4K-4    23.4µs ± 0%      0.3µs ± 2%    -98.60%  (p=0.008 n=5+5)
IndexRuneASCII/4M-4    24.0ms ± 1%      0.3ms ± 1%    -98.60%  (p=0.008 n=5+5)
IndexRuneASCII/64M-4    386ms ± 2%        6ms ± 1%    -98.57%  (p=0.008 n=5+5)

name                  old speed     new speed       delta
IndexRune/10-4        188MB/s ± 1%    344MB/s ± 1%    +82.91%  (p=0.008 n=5+5)
IndexRune/32-4        167MB/s ± 0%   1175MB/s ± 1%   +603.52%  (p=0.008 n=5+5)
IndexRune/4K-4        174MB/s ± 1%   4117MB/s ± 1%  +2262.71%  (p=0.008 n=5+5)
IndexRune/4M-4        176MB/s ± 0%   4299MB/s ± 2%  +2340.46%  (p=0.008 n=5+5)
IndexRune/64M-4       175MB/s ± 1%   4354MB/s ± 1%  +2388.57%  (p=0.008 n=5+5)
IndexRuneASCII/10-4   163MB/s ± 0%    968MB/s ± 4%   +494.66%  (p=0.008 n=5+5)
IndexRuneASCII/32-4   157MB/s ± 0%   2974MB/s ± 4%  +1788.59%  (p=0.008 n=5+5)
IndexRuneASCII/4K-4   175MB/s ± 0%  12481MB/s ± 2%  +7027.71%  (p=0.008 n=5+5)
IndexRuneASCII/4M-4   175MB/s ± 1%  12510MB/s ± 1%  +7061.15%  (p=0.008 n=5+5)
IndexRuneASCII/64M-4  174MB/s ± 2%  12143MB/s ± 1%  +6881.70%  (p=0.008 n=5+5)

Change-Id: I0632eadb83937c2a9daa7f0ce79df1dee64f992e
Reviewed-on: https://go-review.googlesource.com/28537
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-09-06 23:32:57 +00:00
Ilya Tocar 44f1854c9d bytes: Use the same algorithm as strings for Index
name                     old time/op    new time/op      delta
IndexByte32-48             9.05ns ± 7%      9.59ns ±11%     +5.93%  (p=0.001 n=19+20)
IndexByte4K-48              118ns ± 4%       122ns ± 8%     +3.52%  (p=0.002 n=19+19)
IndexByte4M-48              172µs ±13%       188µs ±12%     +9.49%  (p=0.000 n=20+20)
IndexByte64M-48            8.00ms ±14%      8.05ms ±23%       ~     (p=0.799 n=20+20)
IndexBytePortable32-48     41.7ns ±15%      42.5ns ±12%       ~     (p=0.372 n=20+20)
IndexBytePortable4K-48     3.08µs ±16%      3.26µs ±10%     +5.77%  (p=0.018 n=20+20)
IndexBytePortable4M-48     3.12ms ±17%      3.20ms ±10%       ~     (p=0.157 n=20+20)
IndexBytePortable64M-48    54.0ms ±14%      55.3ms ±14%       ~     (p=0.640 n=20+20)
Index32-48                  230ns ±12%        46ns ± 6%    -79.87%  (p=0.000 n=20+19)
Index4K-48                 43.2µs ± 9%       3.2µs ±12%    -92.58%  (p=0.000 n=19+20)
Index4M-48                 44.4ms ± 7%       3.3ms ±13%    -92.59%  (p=0.000 n=19+20)
Index64M-48                 714ms ±10%        56ms ± 8%    -92.22%  (p=0.000 n=19+19)
IndexEasy32-48             52.7ns ±10%      31.0ns ±11%    -41.21%  (p=0.000 n=20+20)
IndexEasy4K-48              139ns ± 5%      1598ns ± 6%  +1046.37%  (p=0.000 n=19+19)
IndexEasy4M-48              179µs ± 8%      1674µs ±10%   +834.31%  (p=0.000 n=19+20)
IndexEasy64M-48            8.56ms ±10%     27.82ms ±16%   +225.14%  (p=0.000 n=19+20)

name                     old speed      new speed        delta
IndexByte32-48           3.52GB/s ± 7%    3.35GB/s ±11%     -4.99%  (p=0.001 n=20+20)
IndexByte4K-48           34.5GB/s ± 7%    33.2GB/s ±10%     -3.67%  (p=0.002 n=20+20)
IndexByte4M-48           24.6GB/s ±14%    22.4GB/s ±14%     -8.73%  (p=0.000 n=20+20)
IndexByte64M-48          8.42GB/s ±16%    8.42GB/s ±19%       ~     (p=0.799 n=20+20)
IndexBytePortable32-48    770MB/s ±13%     756MB/s ±11%       ~     (p=0.383 n=20+20)
IndexBytePortable4K-48   1.34GB/s ±14%    1.26GB/s ±10%     -5.76%  (p=0.018 n=20+20)
IndexBytePortable4M-48   1.35GB/s ±15%    1.31GB/s ±11%       ~     (p=0.157 n=20+20)
IndexBytePortable64M-48  1.25GB/s ±16%    1.22GB/s ±13%       ~     (p=0.640 n=20+20)
Index32-48                138MB/s ± 8%     687MB/s ± 8%   +398.57%  (p=0.000 n=19+20)
Index4K-48               94.9MB/s ± 9%  1280.5MB/s ±11%  +1249.11%  (p=0.000 n=19+20)
Index4M-48               94.6MB/s ± 7%  1278.5MB/s ±12%  +1250.99%  (p=0.000 n=19+20)
Index64M-48              94.2MB/s ±10%  1210.9MB/s ± 8%  +1185.04%  (p=0.000 n=19+19)
IndexEasy32-48            608MB/s ±10%    1035MB/s ±10%    +70.15%  (p=0.000 n=20+20)
IndexEasy4K-48           29.3GB/s ± 6%     2.6GB/s ± 6%    -91.24%  (p=0.000 n=19+19)
IndexEasy4M-48           23.3GB/s ±10%     2.5GB/s ± 9%    -89.23%  (p=0.000 n=20+20)
IndexEasy64M-48          7.86GB/s ±11%    2.42GB/s ±14%    -69.18%  (p=0.000 n=19+20)

Change-Id: Ia191f0a6ca80e113397d9ed98d25f195768b65bc
Reviewed-on: https://go-review.googlesource.com/22550
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-09-01 18:05:50 +00:00