Use aligned buffer operations in 8x8/16x16 2D-DCT

This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.

Change-Id: I137758b81cd127b936175284310e81378db64552
1 file changed