Optimize transpose_neon.h helper functions

1) Use vtrn[12]q_[su]64 in vpx_vtrnq_[su]64* helpers on AArch64
   targets. This produces half as many TRN1/2 instructions compared to
   the number of MOVs that result from vcombine.

2) Use vpx_vtrnq_[su]64* helpers wherever applicable.

3) Refactor transpose_4x8_s16 to operate on 128-bit vectors.

Change-Id: I9a8b1c1fe2a98a429e0c5f39def5eb2f65759127
1 file changed