Ben Shi
aaf73c6d1e
cmd/compile: optimize ARM64 with shifted register indexed load/store
...
ARM64 supports efficient instructions which combine shift, addition, load/store
together. Such as "MOVD (R0)(R1<<3), R2" and "MOVWU R6, (R4)(R1<<2)".
This CL optimizes the compiler to emit such efficient instuctions. And below
is some test data.
1. binary size before/after
binary size change
pkg/linux_arm64 +80.1KB
pkg/tool/linux_arm64 +121.9KB
go -4.3KB
gofmt -64KB
2. go1 benchmark
There is big improvement for the test case Fannkuch11, and slight
improvement for sme others, excluding noise.
name old time/op new time/op delta
BinaryTree17-4 43.9s ± 2% 44.0s ± 2% ~ (p=0.820 n=30+30)
Fannkuch11-4 30.6s ± 2% 24.5s ± 3% -19.93% (p=0.000 n=25+30)
FmtFprintfEmpty-4 500ns ± 0% 499ns ± 0% -0.11% (p=0.000 n=23+25)
FmtFprintfString-4 1.03µs ± 0% 1.04µs ± 3% ~ (p=0.065 n=29+30)
FmtFprintfInt-4 1.15µs ± 3% 1.15µs ± 4% -0.56% (p=0.000 n=30+30)
FmtFprintfIntInt-4 1.80µs ± 5% 1.82µs ± 0% ~ (p=0.094 n=30+24)
FmtFprintfPrefixedInt-4 2.17µs ± 5% 2.20µs ± 0% ~ (p=0.100 n=30+23)
FmtFprintfFloat-4 3.08µs ± 3% 3.09µs ± 4% ~ (p=0.123 n=30+30)
FmtManyArgs-4 7.41µs ± 4% 7.17µs ± 1% -3.26% (p=0.000 n=30+23)
GobDecode-4 93.7ms ± 0% 94.7ms ± 4% ~ (p=0.685 n=24+30)
GobEncode-4 78.7ms ± 7% 77.1ms ± 0% ~ (p=0.729 n=30+23)
Gzip-4 4.01s ± 0% 3.97s ± 5% -1.11% (p=0.037 n=24+30)
Gunzip-4 389ms ± 4% 384ms ± 0% ~ (p=0.155 n=30+23)
HTTPClientServer-4 536µs ± 1% 537µs ± 1% ~ (p=0.236 n=30+30)
JSONEncode-4 179ms ± 1% 182ms ± 6% ~ (p=0.763 n=24+30)
JSONDecode-4 843ms ± 0% 839ms ± 6% -0.42% (p=0.003 n=25+30)
Mandelbrot200-4 46.5ms ± 0% 46.5ms ± 0% +0.02% (p=0.000 n=26+26)
GoParse-4 44.3ms ± 6% 43.3ms ± 0% ~ (p=0.067 n=30+27)
RegexpMatchEasy0_32-4 1.07µs ± 7% 1.07µs ± 4% ~ (p=0.835 n=30+30)
RegexpMatchEasy0_1K-4 5.51µs ± 0% 5.49µs ± 0% -0.35% (p=0.000 n=23+26)
RegexpMatchEasy1_32-4 1.01µs ± 0% 1.02µs ± 4% +0.96% (p=0.014 n=24+30)
RegexpMatchEasy1_1K-4 7.43µs ± 0% 7.18µs ± 0% -3.41% (p=0.000 n=23+24)
RegexpMatchMedium_32-4 1.78µs ± 0% 1.81µs ± 4% +1.47% (p=0.012 n=23+30)
RegexpMatchMedium_1K-4 547µs ± 1% 542µs ± 3% -0.90% (p=0.003 n=24+30)
RegexpMatchHard_32-4 30.4µs ± 0% 29.7µs ± 0% -2.15% (p=0.000 n=19+23)
RegexpMatchHard_1K-4 913µs ± 0% 915µs ± 6% +0.25% (p=0.012 n=24+30)
Revcomp-4 6.32s ± 1% 6.42s ± 4% ~ (p=0.342 n=25+30)
Template-4 868ms ± 6% 878ms ± 6% +1.15% (p=0.000 n=30+30)
TimeParse-4 4.57µs ± 4% 4.59µs ± 3% +0.65% (p=0.010 n=29+30)
TimeFormat-4 4.51µs ± 0% 4.50µs ± 0% -0.27% (p=0.000 n=27+24)
[Geo mean] 695µs 689µs -0.92%
name old speed new speed delta
GobDecode-4 8.19MB/s ± 0% 8.12MB/s ± 4% ~ (p=0.680 n=24+30)
GobEncode-4 9.76MB/s ± 7% 9.96MB/s ± 0% ~ (p=0.616 n=30+23)
Gzip-4 4.84MB/s ± 0% 4.89MB/s ± 4% +1.16% (p=0.030 n=24+30)
Gunzip-4 49.9MB/s ± 4% 50.6MB/s ± 0% ~ (p=0.162 n=30+23)
JSONEncode-4 10.9MB/s ± 1% 10.7MB/s ± 6% ~ (p=0.575 n=24+30)
JSONDecode-4 2.30MB/s ± 0% 2.32MB/s ± 5% +0.72% (p=0.003 n=22+30)
GoParse-4 1.31MB/s ± 6% 1.34MB/s ± 0% +2.26% (p=0.002 n=30+27)
RegexpMatchEasy0_32-4 30.0MB/s ± 6% 30.0MB/s ± 4% ~ (p=1.000 n=30+30)
RegexpMatchEasy0_1K-4 186MB/s ± 0% 187MB/s ± 0% +0.35% (p=0.000 n=23+26)
RegexpMatchEasy1_32-4 31.8MB/s ± 0% 31.5MB/s ± 4% -0.92% (p=0.012 n=25+30)
RegexpMatchEasy1_1K-4 138MB/s ± 0% 143MB/s ± 0% +3.53% (p=0.000 n=23+24)
RegexpMatchMedium_32-4 560kB/s ± 0% 553kB/s ± 4% -1.19% (p=0.005 n=23+30)
RegexpMatchMedium_1K-4 1.87MB/s ± 0% 1.89MB/s ± 3% +1.04% (p=0.002 n=24+30)
RegexpMatchHard_32-4 1.05MB/s ± 0% 1.08MB/s ± 0% +2.40% (p=0.000 n=19+23)
RegexpMatchHard_1K-4 1.12MB/s ± 0% 1.12MB/s ± 5% +0.12% (p=0.006 n=25+30)
Revcomp-4 40.2MB/s ± 1% 39.6MB/s ± 4% ~ (p=0.242 n=25+30)
Template-4 2.24MB/s ± 6% 2.21MB/s ± 6% -1.15% (p=0.000 n=30+30)
[Geo mean] 7.87MB/s 7.91MB/s +0.44%
Change-Id: If374cb7abf83537aa0a176f73c0f736f7800db03
Reviewed-on: https://go-review.googlesource.com/108735
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-04-27 20:02:05 +00:00
Ben Shi
34f5f8a580
cmd/compile: optimize ARM64 with register indexed load/store
...
ARM64 supports load/store instructions with a memory operand that
the address is calculated by base register + index register.
In this CL,
1. Some rules are added to the compile's ARM64 backend to emit
such efficient instructions.
2. A wrong rule of load combination is fixed.
The go1 benchmark does show improvement.
name old time/op new time/op delta
BinaryTree17-4 44.5s ± 2% 44.1s ± 1% -0.81% (p=0.000 n=28+29)
Fannkuch11-4 32.7s ± 3% 30.5s ± 0% -6.79% (p=0.000 n=30+26)
FmtFprintfEmpty-4 499ns ± 0% 506ns ± 5% +1.39% (p=0.003 n=25+30)
FmtFprintfString-4 1.07µs ± 0% 1.04µs ± 4% -3.17% (p=0.000 n=23+30)
FmtFprintfInt-4 1.15µs ± 4% 1.13µs ± 0% -1.55% (p=0.000 n=30+23)
FmtFprintfIntInt-4 1.77µs ± 4% 1.74µs ± 0% -1.71% (p=0.000 n=30+24)
FmtFprintfPrefixedInt-4 2.37µs ± 5% 2.12µs ± 0% -10.56% (p=0.000 n=30+23)
FmtFprintfFloat-4 3.03µs ± 1% 3.03µs ± 4% -0.13% (p=0.003 n=25+30)
FmtManyArgs-4 7.38µs ± 1% 7.43µs ± 4% +0.59% (p=0.003 n=25+30)
GobDecode-4 101ms ± 6% 95ms ± 5% -5.55% (p=0.000 n=30+30)
GobEncode-4 78.0ms ± 4% 78.8ms ± 6% +1.05% (p=0.000 n=30+30)
Gzip-4 4.25s ± 0% 4.27s ± 4% +0.45% (p=0.003 n=24+30)
Gunzip-4 428ms ± 1% 420ms ± 0% -1.88% (p=0.000 n=23+23)
HTTPClientServer-4 549µs ± 1% 541µs ± 1% -1.56% (p=0.000 n=29+29)
JSONEncode-4 194ms ± 0% 188ms ± 4% ~ (p=0.417 n=23+30)
JSONDecode-4 890ms ± 5% 831ms ± 0% -6.55% (p=0.000 n=30+23)
Mandelbrot200-4 47.3ms ± 2% 46.5ms ± 0% ~ (p=0.980 n=30+26)
GoParse-4 43.1ms ± 6% 43.8ms ± 6% +1.65% (p=0.000 n=30+30)
RegexpMatchEasy0_32-4 1.06µs ± 0% 1.07µs ± 3% ~ (p=0.092 n=23+30)
RegexpMatchEasy0_1K-4 5.53µs ± 0% 5.51µs ± 0% -0.24% (p=0.000 n=25+25)
RegexpMatchEasy1_32-4 1.02µs ± 3% 1.01µs ± 0% -1.27% (p=0.000 n=30+24)
RegexpMatchEasy1_1K-4 7.26µs ± 0% 7.33µs ± 0% +0.95% (p=0.000 n=23+26)
RegexpMatchMedium_32-4 1.84µs ± 7% 1.79µs ± 1% ~ (p=0.333 n=30+23)
RegexpMatchMedium_1K-4 553µs ± 0% 547µs ± 0% -1.14% (p=0.000 n=24+22)
RegexpMatchHard_32-4 30.8µs ± 1% 30.3µs ± 0% -1.40% (p=0.000 n=24+24)
RegexpMatchHard_1K-4 928µs ± 0% 929µs ± 5% +0.12% (p=0.013 n=23+30)
Revcomp-4 8.13s ± 4% 6.32s ± 1% -22.23% (p=0.000 n=30+23)
Template-4 899ms ± 6% 854ms ± 1% -5.01% (p=0.000 n=30+24)
TimeParse-4 4.66µs ± 4% 4.59µs ± 1% -1.57% (p=0.000 n=30+23)
TimeFormat-4 4.58µs ± 0% 4.61µs ± 0% +0.57% (p=0.000 n=26+24)
[Geo mean] 717µs 698µs -2.55%
name old speed new speed delta
GobDecode-4 7.63MB/s ± 6% 8.08MB/s ± 5% +5.88% (p=0.000 n=30+30)
GobEncode-4 9.85MB/s ± 4% 9.75MB/s ± 6% -1.04% (p=0.000 n=30+30)
Gzip-4 4.56MB/s ± 0% 4.55MB/s ± 4% -0.36% (p=0.003 n=24+30)
Gunzip-4 45.3MB/s ± 1% 46.2MB/s ± 0% +1.92% (p=0.000 n=23+23)
JSONEncode-4 10.0MB/s ± 0% 10.4MB/s ± 4% ~ (p=0.403 n=23+30)
JSONDecode-4 2.18MB/s ± 5% 2.33MB/s ± 0% +6.91% (p=0.000 n=30+23)
GoParse-4 1.34MB/s ± 5% 1.32MB/s ± 5% -1.66% (p=0.000 n=30+30)
RegexpMatchEasy0_32-4 30.2MB/s ± 0% 29.8MB/s ± 3% ~ (p=0.099 n=23+30)
RegexpMatchEasy0_1K-4 185MB/s ± 0% 186MB/s ± 0% +0.24% (p=0.000 n=25+25)
RegexpMatchEasy1_32-4 31.4MB/s ± 3% 31.8MB/s ± 0% +1.24% (p=0.000 n=30+24)
RegexpMatchEasy1_1K-4 141MB/s ± 0% 140MB/s ± 0% -0.94% (p=0.000 n=23+26)
RegexpMatchMedium_32-4 541kB/s ± 6% 560kB/s ± 0% +3.45% (p=0.000 n=30+23)
RegexpMatchMedium_1K-4 1.85MB/s ± 0% 1.87MB/s ± 0% +1.08% (p=0.000 n=24+23)
RegexpMatchHard_32-4 1.04MB/s ± 1% 1.06MB/s ± 1% +1.48% (p=0.000 n=24+24)
RegexpMatchHard_1K-4 1.10MB/s ± 0% 1.10MB/s ± 5% +0.15% (p=0.004 n=23+30)
Revcomp-4 31.3MB/s ± 4% 40.2MB/s ± 1% +28.52% (p=0.000 n=30+23)
Template-4 2.16MB/s ± 6% 2.27MB/s ± 1% +5.18% (p=0.000 n=30+24)
[Geo mean] 7.57MB/s 7.79MB/s +2.98%
fixes #24907
Change-Id: I94afd0e3f53d62a1cf5e452f3dd6daf61be21785
Reviewed-on: https://go-review.googlesource.com/107376
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-04-19 15:08:10 +00:00
Alberto Donizetti
467eca6076
test/codegen: port last stack and memcombining tests
...
And delete them from asm_test.
Also delete an arm64 cmov test has been already ported to the new test
harness.
Change-Id: I4458721e1f512bc9ecbbe1c22a2c9c7109ad68fe
Reviewed-on: https://go-review.googlesource.com/106335
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
2018-04-11 16:08:04 +00:00
Alberto Donizetti
54c3f56ee0
test/codegen: port various mem-combining tests
...
And delete them from asm_test.
Change-Id: I0e33d58274951ab5acb67b0117b60ef617ea887a
Reviewed-on: https://go-review.googlesource.com/105735
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2018-04-09 12:00:06 +00:00
Alberto Donizetti
3e31eb6b84
test/codegen: port arm64 slice zeroing tests
...
Finish porting arm64 slice zeroing codegen tests; delete them from
asm_test.
Change-Id: Id2532df8ba1c340fa662a6b5238daa3de30548be
Reviewed-on: https://go-review.googlesource.com/105136
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
2018-04-07 09:55:51 +00:00
Alberto Donizetti
f2abca90a2
test/codegen: port arm64 byte slice zeroing tests
...
And delete them from asm_test.
Change-Id: Id533130470da9176a401cb94972f626f43a62148
Reviewed-on: https://go-review.googlesource.com/103656
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
2018-04-04 13:18:15 +00:00
Wei Xiao
05962561ae
cmd/compile/internal/ssa: improve store combine optimization on arm64
...
Current implementation doesn't consider MOVDreg type operand and fail to combine
it into larger store. This patch fixes the issue.
Fixes #24242
Change-Id: I7d68697f80e76f48c3528ece01a602bf513248ec
Reviewed-on: https://go-review.googlesource.com/98397
Run-TryBot: Giovanni Bajo <rasky@develer.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-03-06 20:29:04 +00:00
Giovanni Bajo
fad31e513d
test: move load/store combines into asmcheck
...
This CL moves the load/store combining tests into asmcheck.
In addition at being more compact, it's also now easier to
spot what it is missing in each architecture.
While doing so, I think I uncovered a bug in ppc64le and arm64
rules, because they fail to load/store combine in non-trivial
functions. Not sure why, I'll open an issue.
Change-Id: Ia1572d53c0553d9104f3e52b95e4d1768a8440a3
Reviewed-on: https://go-review.googlesource.com/98441
Run-TryBot: Giovanni Bajo <rasky@develer.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2018-03-04 16:52:03 +00:00