commit | 034c0e2fc805e8bea53d47351da429d7f57bccf2 | [log] [tgz] |
---|---|---|
author | Lukas Geiger <lukas.geiger94@gmail.com> | Thu Oct 15 17:29:35 2020 -0700 |
committer | Copybara-Service <copybara-worker@google.com> | Thu Oct 15 17:29:57 2020 -0700 |
tree | 3638dcca64e9b3ea76e65854b2e00cb4c50d4fb3 | |
parent | e59c55d78f1a041e6a14771254f8e6280804b430 [diff] |
Use movi NEON instruction to zero out registers Currently `dup` is used to zero our NEON registers in the packing and AArch64 kernel code. According to the [Cortex A72 optimization guide](https://developer.arm.com/documentation/uan0016/a/) which is used in the Raspberry PI 4, `dup` has an execution latency of 8 cycles and a throughput of 1 when copying from a general purpose register to a NEON register. This PR changes the code to use `movi` which has a latency of 3 cycles and a throughput of 2. This is also used in [LLVM for zeroing out registers](https://github.com/llvm/llvm-project/blob/master/llvm/test/CodeGen/AArch64/arm64-zero-cycle-zeroing.ll), but please let me know if I am missing something here. I briefly benchmarked this code on a Pixel phone but didn't see any measurable difference which I think is expected since on the used A76 architecture `dup` only has a latency of 3 cycles so this PR won't have a large effect anyway. Closes https://github.com/google/ruy/pull/203 COPYBARA_INTEGRATE_REVIEW=https://github.com/google/ruy/pull/203 from lgeiger:movi-to-zero-neon-register 106c13e330117fdc9cb4a52c1cef7bcce8836017 PiperOrigin-RevId: 337416443
This is not an officially supported Google product.
ruy is a matrix multiplication library. Its focus is to cover the matrix multiplication needs of neural network inference engines. Its initial user has been TensorFlow Lite, where it is used by default on the ARM CPU architecture.
ruy supports both floating-point and 8bit-integer-quantized matrices.
ruy is designed to achieve high performance not just on very large sizes, as is the focus of many established libraries, but on whatever are the actual sizes and shapes of matrices most critical in current TensorFlow Lite applications. This often means quite small sizes, e.g. 100x100 or even 50x50, and all sorts of rectangular shapes. It's not as fast as completely specialized code for each shape, but it aims to offer a good compromise of speed across all shapes and a small binary size.
Some documentation will eventually be available in the doc/ directory, see doc/README.md.