use ld2.4s/ld4.4s in load64/load128

We can load into two or four adjacent registers from alloc_tmp(N) and
then simply name the right one as our output, discarding the others.

This still has the major TODO of returning all two/four registers at
once, eliminating the immB selector and the redundant memory traffic.
Expect changes towards support for up to 4 Vals per Op next week.

Change-Id: I94846d15ac59d4018c1c9d136c17833e5091f8cf
Reviewed-on: https://skia-review.googlesource.com/c/skia/+/357305
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
1 file changed