2x, opt level -O2
old 0,42 runtime of c function
new 0,36 runtime of c function
4x, opt level -O2
old 0,30 runtime of c function
new 0,23 runtime of c function
Arm neon performance is much better if more data can be loaded and
processed.
Four vectors. Opt level -O3
—————————————-
runtime 100% - C function
runtime 99% - Neon function
runtime 56% - Neon 2x function
runtime 36% - Neon 4x function
Four vectors. Opt level -O2
—————————————-
runtime 100% - C function
runtime 71% - Neon function
runtime 43% - Neon 2x function
runtime 30% - Neon 4x function
Compared to C function DotProduct runs slower.
-O0 factor 0,86
-O1 factor 1,60
-O2 factor 1,59
-O3 factor 1,57
Six values and 3x mult/add is not enough workload to fill at least two
quads and hide neon latency.