Use movi NEON instruction to zero out registers

Currently `dup` is used to zero our NEON registers in the packing and AArch64 kernel code. According to the [Cortex A72 optimization guide](https://developer.arm.com/documentation/uan0016/a/) which is used in the Raspberry PI 4, `dup` has an execution latency of 8 cycles and a throughput of 1 when copying from a general purpose register to a NEON register.

This PR changes the code to use `movi` which has a latency of 3 cycles and a throughput of 2. This is also used in [LLVM for zeroing out registers](https://github.com/llvm/llvm-project/blob/master/llvm/test/CodeGen/AArch64/arm64-zero-cycle-zeroing.ll), but please let me know if I am missing something here.

I briefly benchmarked this code on a Pixel phone but didn't see any measurable difference which I think is expected since on the used A76 architecture `dup` only has a latency of 3 cycles so this PR won't have a large effect anyway.

Closes https://github.com/google/ruy/pull/203

COPYBARA_INTEGRATE_REVIEW=https://github.com/google/ruy/pull/203 from lgeiger:movi-to-zero-neon-register 106c13e330117fdc9cb4a52c1cef7bcce8836017
PiperOrigin-RevId: 337416443
2 files changed
tree: 3638dcca64e9b3ea76e65854b2e00cb4c50d4fb3
  1. doc/
  2. ruy/
  3. third_party/
  4. BUILD
  5. CONTRIBUTING.md
  6. LICENSE
  7. README.md
  8. WORKSPACE
README.md

The ruy matrix multiplication library

This is not an officially supported Google product.

ruy is a matrix multiplication library. Its focus is to cover the matrix multiplication needs of neural network inference engines. Its initial user has been TensorFlow Lite, where it is used by default on the ARM CPU architecture.

ruy supports both floating-point and 8bit-integer-quantized matrices.

Efficiency

ruy is designed to achieve high performance not just on very large sizes, as is the focus of many established libraries, but on whatever are the actual sizes and shapes of matrices most critical in current TensorFlow Lite applications. This often means quite small sizes, e.g. 100x100 or even 50x50, and all sorts of rectangular shapes. It's not as fast as completely specialized code for each shape, but it aims to offer a good compromise of speed across all shapes and a small binary size.

Documentation

Some documentation will eventually be available in the doc/ directory, see doc/README.md.