Improve sad3x16 SSE2 function

Vp9_sad3x16_sse2() is heavily called in decoder, in which the
unaligned reads consume lots of cpu cycles. When CONFIG_SUBPELREFMV
is off, the unaligned offset is 1. In this situation,
we can adjust the src_ptr to be 4-byte aligned, and then do the
aligned reads. This reduced the reading time significantly. Tests
on 1080p clip showed over 2% decoder performance gain with
CONFIG_SUBPELREFM off.

Change-Id: I953afe3ac5406107933ef49d0b695eafba9a6507
3 files changed